[jira] [Work logged] (HIVE-25843) Add flag to disable Iceberg FileIO config serialization

ASF GitHub Bot (Jira) Wed, 05 Jan 2022 06:39:13 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-25843?focusedWorklogId=704019&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-704019
 ]


ASF GitHub Bot logged work on HIVE-25843:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Jan/22 14:38
            Start Date: 05/Jan/22 14:38
    Worklog Time Spent: 10m 
      Work Description: marton-bod removed a comment on pull request #2917:
URL: https://github.com/apache/hive/pull/2917#issuecomment-1005740910


   @pvary Can you please take an initial look? I'm still thinking about the 
best way to do this, but currently I think using a validation method on the 
storage handler is the best way to go. Not entirely comfortable with tying this 
new method to the FileSinkDesc (ideally I'd like to make it a bit more generic) 
but so far that was the only thing that worked out well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 704019)
    Time Spent: 0.5h  (was: 20m)

> Add flag to disable Iceberg FileIO config serialization
> -------------------------------------------------------
>
>                 Key: HIVE-25843
>                 URL: https://issues.apache.org/jira/browse/HIVE-25843
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Marton Bod
>            Assignee: Marton Bod
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hive serializes the Iceberg table object into each individual split. Since 
> the FileIO is part of the Iceberg table and it has its own hadoop 
> configuration, this configuration will be the dominant factor determining the 
> size of the serialized split. In our tests we have found that due to this 
> serialized config, iceberg splits are 15-20x larger than normal Hive splits 
> (which led to OOM in some of our perf tests).
> This PR proposes to introduce a config which can turn off this config 
> serialization, and let the deserializer-side fill out the config values 
> instead (which works for Hive executors, since they have all the config 
> values in hand). This can reduce the Iceberg split size by ~20x based on 
> local tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25843) Add flag to disable Iceberg FileIO config serialization

Reply via email to