[ 
https://issues.apache.org/jira/browse/HIVE-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858485#comment-15858485
 ] 

Thomas Poepping commented on HIVE-15852:
----------------------------------------

[~ashutoshc] Ashutosh, sorry it took so long to open this Jira issue. Here's a 
summary of what I've found so far. While it's the easiest solution, I really 
don't want to revert HIVE-13040, I think the performance gains can be large, 
especially in the blobstore (s3a or azure) case, as empty file creation is far 
from free.

Happy to hear suggestions, and start a conversation.

> Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-15852
>                 URL: https://issues.apache.org/jira/browse/HIVE-15852
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 2.1.1
>            Reporter: Thomas Poepping
>
> Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which 
> doesn't create empty files to represent empty buckets when Hive is on Tez, a 
> couple things are broken.
> First of all, if there are empty buckets (which is possible with large 
> datasets in the partitioned-bucketed case), tablesampling will not work if 
> you're referencing a bucket number higher than the number of files.
> e.g. In some partition 'p', there are three rows. The table 't' is clustered 
> into ten buckets. With maximal hashing, only three bucket files will be 
> created. If we do select * from t tablesample (bucket x out of 10) where 
> <selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be 
> thrown because Hive assumes there are only three buckets.
> Second, other applications (such as Pig) may be making assumptions about the 
> number of files equaling the number of buckets.
> Possible fixes:
> * Revert HIVE-13040
> * Change how tablesampling is implemented to accept possibility that number 
> of files != number of buckets
> ** Would require coordination across projects to change assumptions
> Things to consider:
> * what performance gains are there from not creating empty files?
> * if the gains are large, are we willing to lose them? (by reverting 
> HIVE-13040)
> * _how else can we avoid creating unnecessary files, while still maintaining 
> invariants other applications expect?_



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to