[jira] [Comment Edited] (HIVE-15032) Update/Delete statements use dynamic partitions when it's not necessary

Eugene Koifman (JIRA) Fri, 21 Oct 2016 14:21:17 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596364#comment-15596364
 ]


Eugene Koifman edited comment on HIVE-15032 at 10/21/16 9:20 PM:
-----------------------------------------------------------------

Some pointers from [~jcamachorodriguez]
{noformat}
Currently the partition pruning optimization is carried out in CBO, while 
partition column pruner removal (i.e. removal of parts of a Filter predicate 
that is not necessary because partition pruning kicked in) is still executed in 
Hive.

I could not figure out what your question refers to, but I attach some pointers:
- Partition column property is carried out in ExprNodeColumnDesc. Further, 
Table in org.apache.hadoop.hive.ql.metadata exposes information about the 
partition columns for each Table.
- RelOptHiveTable contains logic to compute the necessary partition list for a 
given predicate. This logic relies on PartitionPruner, that can be found in 
org.apache.hadoop.hive.ql.optimizer.ppr.
- Partition pruning in CBO in carried out by HivePartitionPruneRule, which 
relies on the aforementioned method in RelOptHiveTable.
{noformat}

{noformat}
It seems to me that it will not be easy to do it directly from 
UpdateDeleteSemanticAnalyzer, 
as you are translating AST -> AST and thus the necessary aux structures to 
resolve the 
partitions for the query have not been created.

However, given an INSERT query, if Calcite has optimized the plan, you would be 
able to obtain 
the partitions from SemanticAnalyzer and rewrite to avoid dynamic partition 
insert. In particular,
 you can obtain the list of partitions for a given query from prunedPartitions 
in SemanticAnalyzer. 
Once again, this is only valid if Calcite has optimized the plan, as otherwise 
prunedPartitions is not 
populated till the partition pruner optimization runs in Hive.
{noformat}

One simple way to fix this is rewrite the query 2 times.  Update -> Insert and 
do SemanticAnalzyer.analyze() on it as is done currently in 
UpdateDeleteSemanticAnalzyer to create the structures Jesus is referring to and 
then generate another Insert with Static partition info in it (if any was 
found).  This should be well worth it to eliminate DP and "fix" locking along 
the way.

Another option is to look for results in SemanticAnalyzer of analyzing the 
WHERE clase and then modify the structures used in genFileSink() to supply 
static partition values.  Effectively, this teaches the optimizer to recognize 
that the Insert statement is reading and writing the same partition.  This is 
probably not common outside of Update/Delete/Merge.


was (Author: ekoifman):
Some pointers from [~jcamachorodriguez]
{noformat}
Currently the partition pruning optimization is carried out in CBO, while 
partition column pruner removal (i.e. removal of parts of a Filter predicate 
that is not necessary because partition pruning kicked in) is still executed in 
Hive.

I could not figure out what your question refers to, but I attach some pointers:
- Partition column property is carried out in ExprNodeColumnDesc. Further, 
Table in org.apache.hadoop.hive.ql.metadata exposes information about the 
partition columns for each Table.
- RelOptHiveTable contains logic to compute the necessary partition list for a 
given predicate. This logic relies on PartitionPruner, that can be found in 
org.apache.hadoop.hive.ql.optimizer.ppr.
- Partition pruning in CBO in carried out by HivePartitionPruneRule, which 
relies on the aforementioned method in RelOptHiveTable.
{noformat}

{noformat}
It seems to me that it will not be easy to do it directly from 
UpdateDeleteSemanticAnalyzer, as you are translating AST -> AST and thus the 
necessary aux structures to resolve the partitions for the query have not been 
created.

However, given an INSERT query, if Calcite has optimized the plan, you would be 
able to obtain the partitions from SemanticAnalyzer and rewrite to avoid 
dynamic partition insert. In particular, you can obtain the list of partitions 
for a given query from prunedPartitions in SemanticAnalyzer. Once again, this 
is only valid if Calcite has optimized the plan, as otherwise prunedPartitions 
is not populated till the partition pruner optimization runs in Hive.
{noformat}

One simple way to fix this is rewrite the query 2 times.  Update -> Insert and 
do SemanticAnalzyer.analyze() on it as is done currently in 
UpdateDeleteSemanticAnalzyer to create the structures Jesus is referring to and 
then generate another Insert with Static partition info in it (if any was 
found).  This should be well worth it to eliminate DP and "fix" locking along 
the way.

Another option is to look for results in SemanticAnalyzer of analyzing the 
WHERE clase and then modify the structures used in genFileSink() to supply 
static partition values.  Effectively, this teaches the optimizer to recognize 
that the Insert statement is reading and writing the same partition.  This is 
probably not common outside of Update/Delete/Merge.

> Update/Delete statements use dynamic partitions when it's not necessary
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15032
>                 URL: https://issues.apache.org/jira/browse/HIVE-15032
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 1.0.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> {noformat}
> create table if not exists TAB_PART (a int, b int)  partitioned by (p string) 
> clustered by (a) into 2  buckets stored as orc TBLPROPERTIES 
> ('transactional'='true')
>    insert into TAB_PART partition(p='blah') values(1,2) //this uses static 
> part
>     update TAB_PART set b = 7 where p = 'blah' //this uses DP... WHY?
> {noformat}
> the Update is rewritten into an Insert stmt but 
> SemanticAnalzyer.genFileSink() for this Insert is set up with dynamic 
> partitions
> at least in theory, we should be able to analyze the WHERE clause so that 
> Insert doesn't have to use DP.
> Another important side effect of this is how locks are acquired.  As it is, a 
> SHARED_WRITE is acquired on the TAB_PART table.
> However it would suffice to acquire a SHARED_WRITE on the single partition 
> operated on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-15032) Update/Delete statements use dynamic partitions when it's not necessary

Reply via email to