Prasanth Jayachandran created HIVE-13985:
--------------------------------------------

             Summary: ORC improvements for reducing the file system calls in 
task side
                 Key: HIVE-13985
                 URL: https://issues.apache.org/jira/browse/HIVE-13985
             Project: Hive
          Issue Type: Bug
          Components: ORC
    Affects Versions: 1.3.0, 2.2.0
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran


HIVE-13840 fixed some issues with addition file system invocations during split 
generation. Similarly, this jira will fix issues with additional file system 
invocations on the task side. To avoid reading footers on the task side, users 
can set hive.orc.splits.include.file.footer to true which will serialize the 
orc footers on the splits. But this has issues with serializing unwanted 
information like column statistics and other metadata which are not really 
required for reading orc split on the task side. We can reduce the payload on 
the orc splits by serializing only the minimum required information (stripe 
information, types, compression details). This will decrease the payload on the 
orc splits and can potentially avoid OOMs in application master (AM) during 
split generation. This jira also address other issues concerning the AM cache. 
The local cache used by AM is soft reference cache. This can introduce 
unpredictability across multiple runs of the same query. We can cache the 
serialized footer in the local cache and also use strong reference cache which 
should avoid memory pressure and will have better predictability.

One other improvement that we can do is when 
hive.orc.splits.include.file.footer is set to false, on the task side we make 
one additional file system call to know the size of the file. If we can 
serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to