[ https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-15147: ------------------------------------ Attachment: (was: HIVE-15147.02.WIP.noout.patch) > LLAP: use LLAP cache for non-columnar formats in a somewhat general way > ----------------------------------------------------------------------- > > Key: HIVE-15147 > URL: https://issues.apache.org/jira/browse/HIVE-15147 > Project: Hive > Issue Type: New Feature > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-15147.01.patch, HIVE-15147.WIP.noout.patch, > HIVE-15147.patch > > > The primary goal for the first pass is caching text files. Nothing would > prevent other formats from using the same path, in principle, although, as > was originally done with ORC, it may be better to have native caching support > optimized for each particular format. > Given that caching pure text is not smart, and we already have ORC-encoded > cache that is columnar due to ORC file structure, we will transform data into > columnar ORC. > The general idea is to treat all the data in the world as merely ORC that was > compressed with some poor compression codec, such as csv. Using the original > IF and serde, as well as an ORC writer (with some heavyweight optimizations > disabled, potentially), we can "uncompress" the csv/whatever data into its > "original" ORC representation, then cache it efficiently, by column, and also > reuse a lot of the existing code. > Various other points: > 1) Caching granularity will have to be somehow determined (i.e. how do we > slice the file horizontally, to avoid caching entire columns). As with ORC > uncompressed files, the specific offsets don't really matter as long as they > are consistent between reads. The problem is that the file offsets will > actually need to be propagated to the new reader from the original > inputformat. Row counts are easier to use but there's a problem of how to > actually map them to missing ranges to read from disk. > 2) Obviously, for row-based formats, if any one column that is to be read has > been evicted or is otherwise missing, "all the columns" have to be read for > the corresponding slice to cache and read that one column. The vague plan is > to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps > - it will just so happen that a missing column in disk range list to retrieve > will expand the disk-range-to-read into the whole horizontal slice of the > file. > 3) Granularity/etc. won't work for gzipped text. If anything at all is > evicted, the entire file has to be re-read. Gzipped text is a ridiculous > feature, so this is by design. > 4) In future, it would be possible to also build some form or > metadata/indexes for this cached data to do PPD, etc. This is out of the > scope for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)