[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15147:
------------------------------------
    Summary: LLAP: use LLAP cache for non-columnar formats in a somewhat 
general way  (was: LLAP: support cache for non-columnar formats in a somewhat 
general way)

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> The primary target for the first pass is caching text formats. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will try to reuse that. 
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as ORC writer (with some heavyweight optimizations 
> removed, potentially), we can "uncompress" the data into "original" ORC, then 
> reuse a lot of the existing code.
> Various other points:
> 1) Granularity in the file will have to be somehow determined (horizontal 
> slicing of the file, to avoid caching entire columns). We can base it on 
> arbitrary disk offsets determined during reading, but they will actually have 
> to be propagated to the reader from the original inputformat. Row counts are 
> easier to use but there's a problem of how to actually map them to missing 
> ranges to read from disk.
> 2) Obviously for row-based formats, if any one column one needs is evicted, 
> "all the columns" have to be read for the corresponding slice. The vague plan 
> is to handle this implicitly, similarly to how ORC reader handles CB-RG 
> overlaps - it will just so happen that a missing column will expand the 
> disk-range-to-read into the whole horizontal slice of the file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope of this stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to