[ https://issues.apache.org/jira/browse/HIVE-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052398#comment-17052398 ]
Gopal Vijayaraghavan edited comment on HIVE-22977 at 3/5/20, 6:15 PM: ---------------------------------------------------------------------- This is most likely not an optimization & might make read queries worse. {code} HIVE_ORC_BASE_DELTA_RATIO("hive.exec.orc.base.delta.ratio", 8, "The ratio of base writer and\n" + "delta writer in terms of STRIPE_SIZE and BUFFER_SIZE."), HIVE_ORC_DELTA_STREAMING_OPTIMIZATIONS_ENABLED("hive.exec.orc.delta.streaming.optimizations.enabled", false, "Whether to enable streaming optimizations for ORC delta files. This will disable ORC's internal indexes,\n" + "disable compression, enable fast encoding and disable dictionary encoding."), {code} https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2043 The Stripe sizing for the deltas are 8x smaller than the regular base files, with the assumption that a compactor will go fix it after inserts are done - merging them would result in the bad striping becoming permanent. The streaming inserts do not write any ORC indexes for the same reason, to make streaming faster with the assumption that a compactor will rebuild the min/max/bloom when it runs in the background asynchronously. Merging stripes without rebuilding indexes will result in compacted data having no ability to do predicate push-down. The 10% of data in deltas can behave under-par for read throughput, but making these two permanent by running MergeTask instead is probably going to make the compactor faster and everything else slower. was (Author: gopalv): This is most likely not an optimization & might make read queries worse. The Stripe sizing for the deltas are 8x smaller than the regular base files, with the assumption that a compactor will go fix it after inserts are done - merging them would result in the bad striping becoming permanent. The streaming inserts do not write any ORC indexes for the same reason, to make streaming faster with the assumption that a compactor will rebuild the min/max/bloom when it runs in the background asynchronously. Merging stripes without rebuilding indexes will result in compacted data having no ability to do predicate push-down. The 10% of data in deltas can behave under-par for read throughput, but making these two permanent by running MergeTask instead is probably going to make the compactor faster and everything else slower. > Merge delta files instead of running a query in major/minor compaction > ---------------------------------------------------------------------- > > Key: HIVE-22977 > URL: https://issues.apache.org/jira/browse/HIVE-22977 > Project: Hive > Issue Type: Improvement > Reporter: László Pintér > Assignee: László Pintér > Priority: Major > Attachments: HIVE-22977.01.patch, HIVE-22977.02.patch > > > [Compaction Optimiziation] > We should analyse the possibility to move a delta file instead of running a > major/minor compaction query. > Please consider the following use cases: > - full acid table but only insert queries were run. This means that no > delete delta directories were created. Is it possible to merge the delta > directory contents without running a compaction query? > - full acid table, initiating queries through the streaming API. If there > are no abort transactions during the streaming, is it possible to merge the > delta directory contents without running a compaction query? -- This message was sent by Atlassian Jira (v8.3.4#803005)