transform with distribute by bug?

Wil - Thu, 02 Oct 2014 12:39:26 -0700

Hi,

I was having an issue with a query that involves a transform with a distribute 
by.  The query works in the Hive 0.11 but not in Hive 0.13 (both on EMR).


shell> echo -e "a\tb\tc" > /tmp/a.txt

create table tbl (
  a string,
  b string,
  c string
)
row format delimited
  fields terminated by '\t'
;

load data local inpath '/tmp/a.txt' overwrite into table tbl;

select transform(a, b)
  using 'python foo.py' as (y, z)
from (
  select a, b, c
  from tbl
  distribute by c
) tmp
;


Error:
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col2 from [0:_col0, 1:_col1]
> Caused by: java.lang.RuntimeException: cannot find field _col2 from [0:_col0, 
> 1:_col1]


However, if works if I add sort by into the distribution:

select transform(a, b)
  using 'python foo.py' as (y, z)
from (
  select a, b, c
  from tbl
  distribute by c sort by c
) tmp
;



Is this a valid behavior or a bug?


Hive 0.13 plan (non-working query):

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tbl
            Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
            Select Operator
              expressions: a (type: string), b (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
              Reduce Output Operator
                sort order:
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL 
Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string)
      Reduce Operator Tree:
        Extract
          Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
            Transform Operator
              command: python foo.py
              output info:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL 
Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1


Hive 0.13 plan (working query):

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tbl
            Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
            Select Operator
              expressions: a (type: string), b (type: string), c (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
              Reduce Output Operator
                key expressions: _col2 (type: string)
                sort order: +
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL 
Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string)
      Reduce Operator Tree:
        Extract
          Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
            Transform Operator
              command: python foo.py
              output info:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL Column 
stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 0 Data size: 6 Basic stats: PARTIAL 
Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1


Hive 0.11 plan:

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF 
(TOK_TABNAME tbl))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) 
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL 
b)) (TOK_SELEXPR (TOK_TABLE_OR_COL c))) (TOK_DISTRIBUTEBY (TOK_TABLE_OR_COL 
c)))) tmp)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
(TOK_SELEXPR (TOK_TRANSFORM (TOK_EXPLIST (TOK_TABLE_OR_COL a) (TOK_TABLE_OR_COL 
b)) TOK_SERDE TOK_RECORDWRITER 'python foo.py' TOK_SERDE TOK_RECORDREADER 
(TOK_ALIASLIST y z))))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        tmp:tbl
          TableScan
            alias: tbl
            Select Operator
              expressions:
                    expr: a
                    type: string
                    expr: b
                    type: string
                    expr: c
                    type: string
              outputColumnNames: _col0, _col1, _col2
              Reduce Output Operator
                sort order:
                Map-reduce partition columns:
                      expr: _col2
                      type: string
                tag: -1
                value expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: string
                      expr: _col2
                      type: string
      Reduce Operator Tree:
        Extract
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1
            Transform Operator
              command: python foo.py
              output info:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

transform with distribute by bug?

Reply via email to