[jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Prasanth Jayachandran (JIRA) Thu, 16 Jun 2016 11:49:21 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334425#comment-15334425
 ]


Prasanth Jayachandran commented on HIVE-13985:
----------------------------------------------

:) That file will go away after HIVE-14007.

Yeah hard references may cause OOM but atleast it will be consistent across 
runs. In case of OOM in AM we can easily guess it's because of local cache. But 
with soft references the split computation may vary dramatically under memory 
pressure leading to unpredictable overall query performance. That was the 
rationale behind making this change. initialCapacity is only removed. maximum 
capacity is still retained (for boundedness). Setting higher initialCapacity 
wastes memory unnecessarily. It's a tradeoff between wasting memory vs 
rehashing cost.

> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
>                 Key: HIVE-13985
>                 URL: https://issues.apache.org/jira/browse/HIVE-13985
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 2.2.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, 
> HIVE-13985-branch-2.1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during 
> split generation. Similarly, this jira will fix issues with additional file 
> system invocations on the task side. To avoid reading footers on the task 
> side, users can set hive.orc.splits.include.file.footer to true which will 
> serialize the orc footers on the splits. But this has issues with serializing 
> unwanted information like column statistics and other metadata which are not 
> really required for reading orc split on the task side. We can reduce the 
> payload on the orc splits by serializing only the minimum required 
> information (stripe information, types, compression details). This will 
> decrease the payload on the orc splits and can potentially avoid OOMs in 
> application master (AM) during split generation. This jira also address other 
> issues concerning the AM cache. The local cache used by AM is soft reference 
> cache. This can introduce unpredictability across multiple runs of the same 
> query. We can cache the serialized footer in the local cache and also use 
> strong reference cache which should avoid memory pressure and will have 
> better predictability.
> One other improvement that we can do is when 
> hive.orc.splits.include.file.footer is set to false, on the task side we make 
> one additional file system call to know the size of the file. If we can 
> serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Reply via email to