Sergey Shelukhin created HIVE-15147:
---------------------------------------

             Summary: LLAP: support cache for non-columnar formats in a 
somewhat general way
                 Key: HIVE-15147
                 URL: https://issues.apache.org/jira/browse/HIVE-15147
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin
            Assignee: Sergey Shelukhin


The primary target for the first pass is caching text formats. Nothing would 
prevent other formats from using the same path, in principle, although, as was 
originally done with ORC, it may be better to have native caching support 
optimized for each particular format.
Given that caching pure text is not smart, and we already have ORC-encoded 
cache that is columnar due to ORC file structure, we will try to reuse that. 
The general idea is to treat all the data in the world as merely ORC that was 
compressed with some poor compression codec, such as csv. Using the original IF 
and serde, as well as ORC writer (with some heavyweight optimizations removed, 
potentially), we can "uncompress" the data into "original" ORC, then reuse a 
lot of the existing code.
Various other points:
1) Granularity in the file will have to be somehow determined (horizontal 
slicing of the file, to avoid caching entire columns). We can base it on 
arbitrary disk offsets determined during reading, but they will actually have 
to be propagated to the reader from the original inputformat. Row counts are 
easier to use but there's a problem of how to actually map them to missing 
ranges to read from disk.
2) Obviously for row-based formats, if any one column one needs is evicted, 
"all the columns" have to be read for the corresponding slice. The vague plan 
is to handle this implicitly, similarly to how ORC reader handles CB-RG 
overlaps - it will just so happen that a missing column will expand the 
disk-range-to-read into the whole horizontal slice of the file.
3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, 
the entire file has to be re-read. Gzipped text is a ridiculous feature, so 
this is by design.
4) In future, it would be possible to also build some form or metadata/indexes 
for this cached data to do PPD, etc. This is out of the scope of this stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to