[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

liyunzhang_intel (JIRA) Tue, 30 May 2017 19:39:26 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liyunzhang_intel updated HIVE-11297:
------------------------------------
    Attachment: HIVE-11297.1.patch

[~csun]: update patch, as in my environment,[case "multiple sources, single 
key"|https://issues.apache.org/jira/browse/HIVE-16780] in 
spark_dynamic_pruning.q fails, i could not generate new 
spark_dynamic_partition_pruning.q.out. I extract the test case about "multi 
columns, single source" in a new qfile 
"spark_dynamic_partition_pruning_combine.q"( here i create a configuration item 
" hive.spark.dynamic.partition.pruning.combine" ,so if this config item is not 
enabled, combine op trees for partiition info will not happen)
{code}
set hive.optimize.ppd=true;
set hive.ppd.remove.duplicatefilters=true;
set hive.spark.dynamic.partition.pruning=true;
set hive.optimize.metadataonly=false;
set hive.optimize.index.filter=true;
set hive.strict.checks.cartesian.product=false;
set hive.spark.dynamic.partition.pruning=true;
set hive.spark.dynamic.partition.pruning.combine=true;


-- SORT_QUERY_RESULTS
create table srcpart_date_hour as select ds as ds, ds as `date`, hr as hr, hr 
as hour from srcpart group by ds, hr;
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
set hive.spark.dynamic.partition.pruning.combine=false;
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

I think we can parallel, you can review and i continue to fix HIVE-16780. after 
fixing HIVE-16780 in my environment, i can update the 
spark_dynamic_partition_pruning.q.out with the change of HIVE-11297.

> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

Reply via email to