[ https://issues.apache.org/jira/browse/HIVE-22938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046751#comment-17046751 ]
Marton Bod commented on HIVE-22938: ----------------------------------- [~ashutoshc], [~gopalv] - I'm investigating whether we can stop creating empty bucket files when using MR/Spark (seems like we're already not creating them with Tez). So far I have not seen a scenario which makes use of these empty files: in my local tests, I have manually deleted some of these empty files from the delta directories and did not see any anomalies afterwards when reading the data back, or running compaction. But I might be missing some other area - do you have any ideas where the empty bucket files might become important? > Investigate possibility of removing empty bucket file creation mechanism in > Hive-on-MR > -------------------------------------------------------------------------------------- > > Key: HIVE-22938 > URL: https://issues.apache.org/jira/browse/HIVE-22938 > Project: Hive > Issue Type: Task > Reporter: Marton Bod > Priority: Major > > As a follow-up to HIVE-22918, this ticket is to investigate whether the empty > bucket file creation mechanism can be removed safely when using MR as the > engine. > For a bucketed table of N buckets, each insert will generate N bucket files > in the delta directory, regardless of how many actual buckets are written to. > As an example, if a table has 500 buckets, and we insert a single record, 499 > empty bucket files are generated alongside the single bucket that contains > the actual data. This makes the operation substantially slower in some cases. > This behaviour only seems to happen when using MR as the execution engine. > Some components/parts of the code might depend on this behaviour though, so > it needs to be verified that removing this logic does not interfere with > anything. -- This message was sent by Atlassian Jira (v8.3.4#803005)