[ https://issues.apache.org/jira/browse/HIVE-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120678#comment-16120678 ]
Eugene Koifman commented on HIVE-17138: --------------------------------------- OrcRecordUpdater also has some inconsistent logic as to when it creates an empty file. For "legacy" - always - for "default" - never. Should add a switch just like FileSinkOperator that checks engine type (and some other prop) > FileSinkOperator doesn't create empty files for acid path > --------------------------------------------------------- > > Key: HIVE-17138 > URL: https://issues.apache.org/jira/browse/HIVE-17138 > Project: Hive > Issue Type: Bug > Components: Transactions > Affects Versions: 2.2.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > > For bucketed tables, FileSinkOperator is expected (in some cases) to produce > a specific number of files even if they are empty. > FileSinkOperator.closeOp(boolean abort) has logic to create files even if > empty. > This doesn't property work for Acid path. For Insert, the > OrcRecordUpdater(s) is set up in createBucketForFileIdx() which creates the > actual bucketN file (as of HIVE-14007, it does it regardless of whether > RecordUpdater sees any rows). This causes empty (i.e.ORC metadata only) > bucket files to be created for multiFileSpray=true if a particular > FileSinkOperator.process() sees at least 1 row. For example, > {noformat} > create table fourbuckets (a int, b int) clustered by (a) into 4 buckets > stored as orc TBLPROPERTIES ('transactional'='true'); > insert into fourbuckets values(0,1),(1,1); > with mapreduce.job.reduces = 1 or 2 > {noformat} > For Update/Delete path, OrcRecordWriter is created lazily when the 1st row > that needs to land there is seen. Thus it never creates empty buckets no > mater what the value of _skipFiles_ in closeOp(boolean). > Once Split Update does the split early (in operator pipeline) only the Insert > path will matter since base and delta are the only files split computation, > etc looks at. delete_delta is only for Acid internals so there is never any > reason for create empty files there. > Also make sure to close RecordUpdaters in FileSinkOperator.abortWriters() -- This message was sent by Atlassian JIRA (v6.4.14#64029)