[jira] [Comment Edited] (HIVE-22977) Merge delta files instead of running a query in major/minor compaction

Gopal Vijayaraghavan (Jira) Thu, 05 Mar 2020 10:16:21 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052398#comment-17052398
 ]


Gopal Vijayaraghavan edited comment on HIVE-22977 at 3/5/20, 6:15 PM:
----------------------------------------------------------------------

This is most likely not an optimization & might make read queries worse.

{code}
    HIVE_ORC_BASE_DELTA_RATIO("hive.exec.orc.base.delta.ratio", 8, "The ratio 
of base writer and\n" +
        "delta writer in terms of STRIPE_SIZE and BUFFER_SIZE."),
    
HIVE_ORC_DELTA_STREAMING_OPTIMIZATIONS_ENABLED("hive.exec.orc.delta.streaming.optimizations.enabled",
 false,
      "Whether to enable streaming optimizations for ORC delta files. This will 
disable ORC's internal indexes,\n" +
        "disable compression, enable fast encoding and disable dictionary 
encoding."),
{code}

https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2043

The Stripe sizing for the deltas are 8x smaller than the regular base files, 
with the assumption that a compactor will go fix it after inserts are done - 
merging them would result in the bad striping becoming permanent.

The streaming inserts do not write any ORC indexes for the same reason, to make 
streaming faster with the assumption that a compactor will rebuild the 
min/max/bloom when it runs in the background asynchronously. Merging stripes 
without rebuilding indexes will result in compacted data having no ability to 
do predicate push-down. 

The 10% of data in deltas can behave under-par for read throughput, but making 
these two permanent by running MergeTask instead is probably going to make the 
compactor faster and everything else slower.


was (Author: gopalv):
This is most likely not an optimization & might make read queries worse.

The Stripe sizing for the deltas are 8x smaller than the regular base files, 
with the assumption that a compactor will go fix it after inserts are done - 
merging them would result in the bad striping becoming permanent.

The streaming inserts do not write any ORC indexes for the same reason, to make 
streaming faster with the assumption that a compactor will rebuild the 
min/max/bloom when it runs in the background asynchronously. Merging stripes 
without rebuilding indexes will result in compacted data having no ability to 
do predicate push-down. 

The 10% of data in deltas can behave under-par for read throughput, but making 
these two permanent by running MergeTask instead is probably going to make the 
compactor faster and everything else slower.

> Merge delta files instead of running a query in major/minor compaction
> ----------------------------------------------------------------------
>
>                 Key: HIVE-22977
>                 URL: https://issues.apache.org/jira/browse/HIVE-22977
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>         Attachments: HIVE-22977.01.patch, HIVE-22977.02.patch
>
>
> [Compaction Optimiziation]
> We should analyse the possibility to move a delta file instead of running a 
> major/minor compaction query.
> Please consider the following use cases:
>  - full acid table but only insert queries were run. This means that no 
> delete delta directories were created. Is it possible to merge the delta 
> directory contents without running a compaction query?
>  - full acid table, initiating queries through the streaming API. If there 
> are no abort transactions during the streaming, is it possible to merge the 
> delta directory contents without running a compaction query?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-22977) Merge delta files instead of running a query in major/minor compaction

Reply via email to