[ https://issues.apache.org/jira/browse/HIVE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mostafa Mokhtar updated HIVE-8671: ---------------------------------- Description: Overflow in row counts and data size for several TPC-DS queries. Interestingly the operators which have overflow end up running with a small parallelism. For instance Reducer 2 has an overflow but it only runs with parallelism of 2. {code} Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized {code} {code} VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS Map 1 62 26.41 1,779,510 211,978,502 60,628,390 Map 5 1 4.28 6,950 138,098 138,098 Map 6 1 2.44 3,910 31 31 Reducer 2 2 22.69 61,320 60,628,390 69,182 Reducer 3 1 2.63 3,910 69,182 100 Reducer 4 1 1.01 1,180 100 100 {code} Query {code} explain select i_item_desc ,i_category ,i_class ,i_current_price ,i_item_id ,sum(ws_ext_sales_price) as itemrevenue ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over (partition by i_class) as revenueratio from web_sales ,item ,date_dim where web_sales.ws_item_sk = item.i_item_sk and item.i_category in ('Jewelry', 'Sports', 'Books') and web_sales.ws_sold_date_sk = date_dim.d_date_sk and date_dim.d_date between '2001-01-12' and '2001-02-11' group by i_item_id ,i_item_desc ,i_category ,i_class ,i_current_price order by i_category ,i_class ,i_item_id ,i_item_desc ,revenueratio limit 100 {code} Explain {code} STAGE PLANS: Stage: Stage-1 Tez Edges: Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 Vertices: Map 1 Map Operator Tree: TableScan alias: web_sales filterExpr: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 2850189889652 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ws_item_sk (type: int), ws_ext_sales_price (type: float), ws_sold_date_sk (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} {_col1} 1 keys: 0 _col2 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1 input vertices: 1 Map 6 Statistics: Num rows: 24145061366 Data size: 193160490928 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col1} 1 {_col1} {_col2} {_col3} {_col4} {_col5} keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col1, _col6, _col7, _col8, _col9, _col10 input vertices: 1 Map 5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col6 (type: string), _col7 (type: string), _col10 (type: string), _col9 (type: string), _col8 (type: float), _col1 (type: float) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col5) keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) sort order: +++++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col5 (type: double) Execution mode: vectorized Map 5 Map Operator Tree: TableScan alias: item filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float), i_class (type: string), i_category (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float), _col4 (type: string), _col5 (type: string) Execution mode: vectorized Map 6 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: d_date_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Group By Operator keys: _col0 (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Dynamic Partitioning Event Operator Target Input: web_sales Partition key expr: ws_sold_date_sk Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Target column: ws_sold_date_sk Target Vertex: Map 1 Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized Reducer 3 Reduce Operator Tree: Extract Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE PTF Operator Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: double), ((_col5 * 100.0) / _wcol0) (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col1 (type: string), _col2 (type: string), _col4 (type: string), _col0 (type: string), _col6 (type: double) sort order: +++++ Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE TopN Hash Memory Usage: 0.04 value expressions: _col3 (type: float), _col5 (type: double) Reducer 4 Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 (type: double), KEY.reducesinkkey4 (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Limit Number of rows: 100 Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Stage: Stage-0 Fetch Operator limit: 100 Processor Tree: ListSink {code} was: Overflow in row counts and data size for several TPC-DS queries. Interestingly the operators which have overflow end up running with a small parallelism. For instance Reducer 2 has an overflow but it only runs with parallelism of 2. {code} Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 9223372036854775807 Data size: 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized {code} {code} VERTEX TOTAL_TASKS FAILED_TASKS KILLED_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS Map 1 62 0 0 26.41 1,779,510 22,242 211,978,502 60,628,390 Map 5 1 0 0 4.28 6,950 85 138,098 138,098 Map 6 1 0 0 2.44 3,910 28 31 31 Reducer 2 2 0 0 22.69 61,320 1,724 60,628,390 69,182 Reducer 3 1 0 0 2.63 3,910 19 69,182 100 Reducer 4 1 0 0 1.01 1,180 33 100 100 {code} Query {code} explain select i_item_desc ,i_category ,i_class ,i_current_price ,i_item_id ,sum(ws_ext_sales_price) as itemrevenue ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over (partition by i_class) as revenueratio from web_sales ,item ,date_dim where web_sales.ws_item_sk = item.i_item_sk and item.i_category in ('Jewelry', 'Sports', 'Books') and web_sales.ws_sold_date_sk = date_dim.d_date_sk and date_dim.d_date between '2001-01-12' and '2001-02-11' group by i_item_id ,i_item_desc ,i_category ,i_class ,i_current_price order by i_category ,i_class ,i_item_id ,i_item_desc ,revenueratio limit 100 {code} Explain {code} STAGE PLANS: Stage: Stage-1 Tez Edges: Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Reducer 4 <- Reducer 3 (SIMPLE_EDGE) DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 Vertices: Map 1 Map Operator Tree: TableScan alias: web_sales filterExpr: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 2850189889652 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ws_item_sk is not null (type: boolean) Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ws_item_sk (type: int), ws_ext_sales_price (type: float), ws_sold_date_sk (type: int) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 21594638446 Data size: 172746300152 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} {_col1} 1 keys: 0 _col2 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1 input vertices: 1 Map 6 Statistics: Num rows: 24145061366 Data size: 193160490928 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col1} 1 {_col1} {_col2} {_col3} {_col4} {_col5} keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col1, _col6, _col7, _col8, _col9, _col10 input vertices: 1 Map 5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col6 (type: string), _col7 (type: string), _col10 (type: string), _col9 (type: string), _col8 (type: float), _col1 (type: float) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 25381041158 Data size: 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col5) keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) sort order: +++++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float) Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col5 (type: double) Execution mode: vectorized Map 5 Map Operator Tree: TableScan alias: item filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 462000 Data size: 663862160 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((i_category) IN ('Jewelry', 'Sports', 'Books') and i_item_sk is not null) (type: boolean) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: i_item_sk (type: int), i_item_id (type: string), i_item_desc (type: string), i_current_price (type: float), i_class (type: string), i_category (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 231000 Data size: 109491664 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: float), _col4 (type: string), _col5 (type: string) Execution mode: vectorized Map 6 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' and d_date_sk is not null) (type: boolean) Statistics: Num rows: 36524 Data size: 3579352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: d_date_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 36524 Data size: 146096 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Group By Operator keys: _col0 (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Dynamic Partitioning Event Operator Target Input: web_sales Partition key expr: ws_sold_date_sk Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Target column: ws_sold_date_sk Target Vertex: Map 1 Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col3 (type: string), _col3 (type: string) sort order: ++ Map-reduce partition columns: _col3 (type: string) Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col5 (type: double) Execution mode: vectorized Reducer 3 Reduce Operator Tree: Extract Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE PTF Operator Statistics: Num rows: 119291 Data size: 1908656 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: double), ((_col5 * 100.0) / _wcol0) (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col1 (type: string), _col2 (type: string), _col4 (type: string), _col0 (type: string), _col6 (type: double) sort order: +++++ Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE TopN Hash Memory Usage: 0.04 value expressions: _col3 (type: float), _col5 (type: double) Reducer 4 Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 (type: double), KEY.reducesinkkey4 (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6 Statistics: Num rows: 119291 Data size: 954328 Basic stats: COMPLETE Column stats: COMPLETE Limit Number of rows: 100 Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 100 Data size: 800 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Stage: Stage-0 Fetch Operator limit: 100 Processor Tree: ListSink {code} > Overflow in estimate row count and data size with fetch column stats > -------------------------------------------------------------------- > > Key: HIVE-8671 > URL: https://issues.apache.org/jira/browse/HIVE-8671 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Critical > Fix For: 0.14.0 > > > Overflow in row counts and data size for several TPC-DS queries. > Interestingly the operators which have overflow end up running with a small > parallelism. > For instance Reducer 2 has an overflow but it only runs with parallelism of 2. > {code} > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), > KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 9223372036854775807 Data size: > 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 9223372036854775807 Data size: > 9223372036854775341 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float), > _col5 (type: double) > Execution mode: vectorized > {code} > {code} > VERTEX TOTAL_TASKS DURATION_SECONDS CPU_TIME_MILLIS > INPUT_RECORDS OUTPUT_RECORDS > Map 1 62 26.41 1,779,510 > 211,978,502 60,628,390 > Map 5 1 4.28 6,950 > 138,098 138,098 > Map 6 1 2.44 3,910 > 31 31 > Reducer 2 2 22.69 61,320 > 60,628,390 69,182 > Reducer 3 1 2.63 3,910 > 69,182 100 > Reducer 4 1 1.01 1,180 > 100 100 > {code} > Query > {code} > explain > select i_item_desc > ,i_category > ,i_class > ,i_current_price > ,i_item_id > ,sum(ws_ext_sales_price) as itemrevenue > ,sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over > (partition by i_class) as revenueratio > from > web_sales > ,item > ,date_dim > where > web_sales.ws_item_sk = item.i_item_sk > and item.i_category in ('Jewelry', 'Sports', 'Books') > and web_sales.ws_sold_date_sk = date_dim.d_date_sk > and date_dim.d_date between '2001-01-12' and '2001-02-11' > group by > i_item_id > ,i_item_desc > ,i_category > ,i_class > ,i_current_price > order by > i_category > ,i_class > ,i_item_id > ,i_item_desc > ,revenueratio > limit 100 > {code} > Explain > {code} > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 1 <- Map 5 (BROADCAST_EDGE), Map 6 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > Reducer 4 <- Reducer 3 (SIMPLE_EDGE) > DagName: mmokhtar_20141019164343_854cb757-01bd-40cb-843e-9ada7c5e6f38:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: web_sales > filterExpr: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 2850189889652 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ws_item_sk is not null (type: boolean) > Statistics: Num rows: 21594638446 Data size: 172746300152 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ws_item_sk (type: int), ws_ext_sales_price > (type: float), ws_sold_date_sk (type: int) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 21594638446 Data size: > 172746300152 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col0} {_col1} > 1 > keys: > 0 _col2 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col0, _col1 > input vertices: > 1 Map 6 > Statistics: Num rows: 24145061366 Data size: > 193160490928 Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col1} > 1 {_col1} {_col2} {_col3} {_col4} {_col5} > keys: > 0 _col0 (type: int) > 1 _col0 (type: int) > outputColumnNames: _col1, _col6, _col7, _col8, > _col9, _col10 > input vertices: > 1 Map 5 > Statistics: Num rows: 25381041158 Data size: > 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col6 (type: string), _col7 (type: > string), _col10 (type: string), _col9 (type: string), _col8 (type: float), > _col1 (type: float) > outputColumnNames: _col0, _col1, _col2, _col3, > _col4, _col5 > Statistics: Num rows: 25381041158 Data size: > 11929089344260 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: sum(_col5) > keys: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, > _col4, _col5 > Statistics: Num rows: 119291 Data size: 954328 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: > float) > sort order: +++++ > Map-reduce partition columns: _col0 (type: > string), _col1 (type: string), _col2 (type: string), _col3 (type: string), > _col4 (type: float) > Statistics: Num rows: 119291 Data size: > 954328 Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col5 (type: double) > Execution mode: vectorized > Map 5 > Map Operator Tree: > TableScan > alias: item > filterExpr: ((i_category) IN ('Jewelry', 'Sports', 'Books') > and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 462000 Data size: 663862160 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((i_category) IN ('Jewelry', 'Sports', > 'Books') and i_item_sk is not null) (type: boolean) > Statistics: Num rows: 231000 Data size: 109491664 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: i_item_sk (type: int), i_item_id (type: > string), i_item_desc (type: string), i_current_price (type: float), i_class > (type: string), i_category (type: string) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5 > Statistics: Num rows: 231000 Data size: 109491664 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 231000 Data size: 109491664 > Basic stats: COMPLETE Column stats: COMPLETE > value expressions: _col1 (type: string), _col2 (type: > string), _col3 (type: float), _col4 (type: string), _col5 (type: string) > Execution mode: vectorized > Map 6 > Map Operator Tree: > TableScan > alias: date_dim > filterExpr: (d_date BETWEEN '2001-01-12' AND '2001-02-11' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date BETWEEN '2001-01-12' AND '2001-02-11' > and d_date_sk is not null) (type: boolean) > Statistics: Num rows: 36524 Data size: 3579352 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 36524 Data size: 146096 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 36524 Data size: 0 Basic > stats: PARTIAL Column stats: COMPLETE > Dynamic Partitioning Event Operator > Target Input: web_sales > Partition key expr: ws_sold_date_sk > Statistics: Num rows: 36524 Data size: 0 Basic > stats: PARTIAL Column stats: COMPLETE > Target column: ws_sold_date_sk > Target Vertex: Map 1 > Execution mode: vectorized > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: sum(VALUE._col0) > keys: KEY._col0 (type: string), KEY._col1 (type: string), > KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: float) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col3 (type: string), _col3 (type: string) > sort order: ++ > Map-reduce partition columns: _col3 (type: string) > Statistics: Num rows: 119291 Data size: 1908656 Basic > stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: string), _col1 (type: > string), _col2 (type: string), _col3 (type: string), _col4 (type: float), > _col5 (type: double) > Execution mode: vectorized > Reducer 3 > Reduce Operator Tree: > Extract > Statistics: Num rows: 119291 Data size: 1908656 Basic stats: > COMPLETE Column stats: COMPLETE > PTF Operator > Statistics: Num rows: 119291 Data size: 1908656 Basic > stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col1 (type: string), _col2 (type: string), > _col3 (type: string), _col4 (type: float), _col0 (type: string), _col5 (type: > double), ((_col5 * 100.0) / _wcol0) (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, > _col5, _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col1 (type: string), _col2 (type: > string), _col4 (type: string), _col0 (type: string), _col6 (type: double) > sort order: +++++ > Statistics: Num rows: 119291 Data size: 954328 Basic > stats: COMPLETE Column stats: COMPLETE > TopN Hash Memory Usage: 0.04 > value expressions: _col3 (type: float), _col5 (type: > double) > Reducer 4 > Reduce Operator Tree: > Select Operator > expressions: KEY.reducesinkkey3 (type: string), > KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string), > VALUE._col0 (type: float), KEY.reducesinkkey2 (type: string), VALUE._col1 > (type: double), KEY.reducesinkkey4 (type: double) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, > _col6 > Statistics: Num rows: 119291 Data size: 954328 Basic stats: > COMPLETE Column stats: COMPLETE > Limit > Number of rows: 100 > Statistics: Num rows: 100 Data size: 800 Basic stats: > COMPLETE Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 100 Data size: 800 Basic stats: > COMPLETE Column stats: COMPLETE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Execution mode: vectorized > Stage: Stage-0 > Fetch Operator > limit: 100 > Processor Tree: > ListSink > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)