[ 
https://issues.apache.org/jira/browse/HIVE-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694343#comment-14694343
 ] 

Sergey Shelukhin edited comment on HIVE-11500 at 8/12/15 10:44 PM:
-------------------------------------------------------------------

On the ACID incompatibility - we cannot exclude stripes from the read if delta 
files are present, because delta files can modify the values of the columns 
that the predicate acts upon. This makes stripe statistics invalid. So I guess 
it's kind of compatible with ACID, as long as there are no delta files at a 
given time.
On cache cleaning - entry by fileId is stale if the file is missing :) 
Compactor can explicitly remove entries; which thread does the background task 
is an implementation detail. One thing about the lazy cleanup though is that it 
should not ideally run every N minutes and examine the entire cache, checking 
all the files; it should run continuously with a (non-binding) goal of 
examining the entire cache every N minutes, spread over the entire N-minute 
interval. So that may require a separate thread.


was (Author: sershe):
On the ACID incompatibility - we cannot exclude stripes from the read if delta 
files are present, because delta files can modify the values of the columns 
that the predicate acts upon. This makes stripe statistics invalid. So I guess 
it's kind of compatible with ACID, as long as there are no delta files at a 
given time.
On cache cleaning - entry by fileId is stale if the file is missing :) 
Compactor can explicitly remove entries; which thread does the background task 
is an implementation detail. One thing about the lazy cleanup though is that it 
should not ideally run every N minutes and examine the entire cache, checking 
all the files; it should run continuously with a (non-binding) goal of 
examining the entire cache every N minutes, spread over the entire N-minute 
interval.

> implement file footer / splits cache in HBase metastore
> -------------------------------------------------------
>
>                 Key: HIVE-11500
>                 URL: https://issues.apache.org/jira/browse/HIVE-11500
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HBase metastore split cache.pdf
>
>
> We need to cache file metadata (e.g. ORC file footers) for split generation 
> (which, on FSes that support fileId, will be valid permanently and only needs 
> to be removed lazily when ORC file is erased or compacted), and potentially 
> even some information about splits (e.g. grouping based on location that 
> would be good for some short time), in HBase metastore.
> -It should be queryable by table. Partition predicate pushdown should be 
> supported. If bucket pruning is added, that too.- Given that we cannot cache 
> file lists (we have to check FS for new/changed files anyway), and the 
> difficulty of passing of data about partitions/etc. to split generation 
> compared to paths, we will probably just filter by paths and fileIds. It 
> might be different for splits
> In later phases, it would be nice to save the (first category above) results 
> of expensive work done by jobs, e.g. data size after decompression/decoding 
> per column, etc. to avoid surprises when ORC encoding is very good, or very 
> bad. Perhaps it can even be lazily generated. Here's a pony: 🐴



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to