[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803284#comment-15803284
 ] 

Sergey Shelukhin commented on HIVE-15147:
-----------------------------------------

https://reviews.apache.org/r/55247/
I see some log lines on info that need to be debug/trace.

Theoretically, it can work with any row-based format, but the horizontal slice 
of caching would be the entire file (it's still separated by columns).
For LineRecordReader w/o compression, file offset support is implemented, 
slicing the files on row boundaries. It should be easy to add for other readers 
with self-contained rows (ie where row has start and end offset).
The slicing right now uses LRR assumptions about Hive splits wrt torn rows. 
They look reasonable - I assume other row readers use the same ones for torn 
rows, but we'd need to check when/if we add support.

A test is added that passes on CliDriver and LlapLocal. May need to test more 
with other tests. 

cc [~gopalv] [~prasanth_j] [~sseth]

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15147.01.WIP.noout.patch, 
> HIVE-15147.02.WIP.noout.patch, HIVE-15147.04.WIP.noout.patch, 
> HIVE-15147.05.WIP.noout.patch, HIVE-15147.WIP.noout.patch, HIVE-15147.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to