[ https://issues.apache.org/jira/browse/HIVE-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858485#comment-15858485 ]
Thomas Poepping commented on HIVE-15852: ---------------------------------------- [~ashutoshc] Ashutosh, sorry it took so long to open this Jira issue. Here's a summary of what I've found so far. While it's the easiest solution, I really don't want to revert HIVE-13040, I think the performance gains can be large, especially in the blobstore (s3a or azure) case, as empty file creation is far from free. Happy to hear suggestions, and start a conversation. > Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException > ----------------------------------------------------------------------------- > > Key: HIVE-15852 > URL: https://issues.apache.org/jira/browse/HIVE-15852 > Project: Hive > Issue Type: Bug > Components: Tez > Affects Versions: 2.1.1 > Reporter: Thomas Poepping > > Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which > doesn't create empty files to represent empty buckets when Hive is on Tez, a > couple things are broken. > First of all, if there are empty buckets (which is possible with large > datasets in the partitioned-bucketed case), tablesampling will not work if > you're referencing a bucket number higher than the number of files. > e.g. In some partition 'p', there are three rows. The table 't' is clustered > into ten buckets. With maximal hashing, only three bucket files will be > created. If we do select * from t tablesample (bucket x out of 10) where > <selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be > thrown because Hive assumes there are only three buckets. > Second, other applications (such as Pig) may be making assumptions about the > number of files equaling the number of buckets. > Possible fixes: > * Revert HIVE-13040 > * Change how tablesampling is implemented to accept possibility that number > of files != number of buckets > ** Would require coordination across projects to change assumptions > Things to consider: > * what performance gains are there from not creating empty files? > * if the gains are large, are we willing to lose them? (by reverting > HIVE-13040) > * _how else can we avoid creating unnecessary files, while still maintaining > invariants other applications expect?_ -- This message was sent by Atlassian JIRA (v6.3.15#6346)