[ https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-15147: ------------------------------------ Summary: LLAP: use LLAP cache for non-columnar formats in a somewhat general way (was: LLAP: support cache for non-columnar formats in a somewhat general way) > LLAP: use LLAP cache for non-columnar formats in a somewhat general way > ----------------------------------------------------------------------- > > Key: HIVE-15147 > URL: https://issues.apache.org/jira/browse/HIVE-15147 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > > The primary target for the first pass is caching text formats. Nothing would > prevent other formats from using the same path, in principle, although, as > was originally done with ORC, it may be better to have native caching support > optimized for each particular format. > Given that caching pure text is not smart, and we already have ORC-encoded > cache that is columnar due to ORC file structure, we will try to reuse that. > The general idea is to treat all the data in the world as merely ORC that was > compressed with some poor compression codec, such as csv. Using the original > IF and serde, as well as ORC writer (with some heavyweight optimizations > removed, potentially), we can "uncompress" the data into "original" ORC, then > reuse a lot of the existing code. > Various other points: > 1) Granularity in the file will have to be somehow determined (horizontal > slicing of the file, to avoid caching entire columns). We can base it on > arbitrary disk offsets determined during reading, but they will actually have > to be propagated to the reader from the original inputformat. Row counts are > easier to use but there's a problem of how to actually map them to missing > ranges to read from disk. > 2) Obviously for row-based formats, if any one column one needs is evicted, > "all the columns" have to be read for the corresponding slice. The vague plan > is to handle this implicitly, similarly to how ORC reader handles CB-RG > overlaps - it will just so happen that a missing column will expand the > disk-range-to-read into the whole horizontal slice of the file. > 3) Granularity/etc. won't work for gzipped text. If anything at all is > evicted, the entire file has to be re-read. Gzipped text is a ridiculous > feature, so this is by design. > 4) In future, it would be possible to also build some form or > metadata/indexes for this cached data to do PPD, etc. This is out of the > scope of this stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)