[ 
https://issues.apache.org/jira/browse/HIVE-11525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906134#comment-14906134
 ] 

Elliot West commented on HIVE-11525:
------------------------------------

I think this may cause issues for data inserted into transactional tables using 
the Hive [HCatalog Streaming 
API|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]. 
These records have a bucket ID assigned randomly and hence I expect in this 
case there is no strong relationship between a bucket column value of an 
inserted row and the actual bucket the row is inserted into. Therefore, for 
such tables it would still be necessary to read all buckets to be certain that 
the {{WHERE}} condition specified in the example is applied correctly.

The code in question is located here:
https://github.com/apache/hive/blob/master/hcatalog/streaming/src/java/org/apache/hive/hcatalog/streaming/AbstractRecordWriter.java#L119

Note that I think that the issue I describe needs to be fixed in the HCatalog 
Streaming API and is not a problem with the feature suggested here. However, I 
wanted to call out that as it stands, this feature might introduce some 
unintended side affects if the random bucket ids are not addressed.

> Bucket pruning
> --------------
>
>                 Key: HIVE-11525
>                 URL: https://issues.apache.org/jira/browse/HIVE-11525
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
>            Reporter: Maciek Kocon
>            Assignee: Takuya Fukudome
>              Labels: gsoc2015
>
> Logically and functionally bucketing and partitioning are quite similar - 
> both provide mechanism to segregate and separate the table's data based on 
> its content. Thanks to that significant further optimisations like 
> [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is 
> open/explicit while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current 
> RDBMS while BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT 
> PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features 
> available in Hive there should be nothing to prevent leveraging same existing 
> query/join optimisations across the two.
> BUCKET pruning
> Enable partition PRUNING equivalent optimisation for queries on BUCKETED 
> tables
> Simplest example is for queries like:
> "SELECT … FROM x WHERE colA=123123"
> to read only the relevant bucket file rather than all file-buckets that 
> belong to a table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to