[
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated HIVE-5945:
---------------------------
Description:
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only
jobs for this query. However, I got 1 Map-only job (joining strore_sales and
date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving
item. In Hive
HiveHIVE-5945
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap
contains all input tables used in this query and the intermediate table
generated by joining store_sales and date_dim. So, when we sum the size of all
small tables, the size of store_sales (which is around 45GB in my test) will be
also counted.
was:
Here is an example
{code}
select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'F' and
cd_marital_status = 'U' and
cd_education_status = 'Primary' and
d_year = 2002 and
s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only
jobs for this query. However, I got 1 Map-only job (joining strore_sales and
date_dim) and 3 MR job (for reduce joins.)
So, I checked the conditional task determining the plan of the join involving
item. In
> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables'
> sizes including those tables which are not used in the child of this
> conditional task.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.13.0
> Reporter: Yin Huai
>
> Here is an example
> {code}
> select
> i_item_id,
> s_state,
> avg(ss_quantity) agg1,
> avg(ss_list_price) agg2,
> avg(ss_coupon_amt) agg3,
> avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk =
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
> cd_gender = 'F' and
> cd_marital_status = 'U' and
> cd_education_status = 'Primary' and
> d_year = 2002 and
> s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
> i_item_id,
> s_state
> order by
> i_item_id,
> s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving
> item. In Hive
> HiveHIVE-5945
> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap
> contains all input tables used in this query and the intermediate table
> generated by joining store_sales and date_dim. So, when we sum the size of
> all small tables, the size of store_sales (which is around 45GB in my test)
> will be also counted.
--
This message was sent by Atlassian JIRA
(v6.1#6144)