[jira] [Commented] (HIVE-15852) Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException

Thomas Poepping (JIRA) Mon, 13 Feb 2017 16:17:31 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864742#comment-15864742
 ]


Thomas Poepping commented on HIVE-15852:
----------------------------------------

[~ashutoshc] Yeah, I think that solution makes the most sense. Even if there 
are some hidden dependencies in other projects, as you say, there is no 
interface contract and so those assumptions should not even be made. 

I will take a look into how tablesampling can be improved. Hopefully the fix is 
not too wide.

> Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-15852
>                 URL: https://issues.apache.org/jira/browse/HIVE-15852
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 2.1.1
>            Reporter: Thomas Poepping
>
> Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which 
> doesn't create empty files to represent empty buckets when Hive is on Tez, a 
> couple things are broken.
> First of all, if there are empty buckets (which is possible with large 
> datasets in the partitioned-bucketed case), tablesampling will not work if 
> you're referencing a bucket number higher than the number of files.
> e.g. In some partition 'p', there are three rows. The table 't' is clustered 
> into ten buckets. With maximal hashing, only three bucket files will be 
> created. If we do select * from t tablesample (bucket x out of 10) where 
> <selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be 
> thrown because Hive assumes there are only three buckets.
> Second, other applications (such as Pig) may be making assumptions about the 
> number of files equaling the number of buckets.
> Possible fixes:
> * Revert HIVE-13040
> * Change how tablesampling is implemented to accept possibility that number 
> of files != number of buckets
> ** Would require coordination across projects to change assumptions
> Things to consider:
> * what performance gains are there from not creating empty files?
> * if the gains are large, are we willing to lose them? (by reverting 
> HIVE-13040)
> * _how else can we avoid creating unnecessary files, while still maintaining 
> invariants other applications expect?_



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15852) Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException

Reply via email to