[jira] [Commented] (HIVE-860) Persistent distributed cache

Gunther Hagleitner (JIRA) Tue, 18 Feb 2014 12:20:47 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904532#comment-13904532
 ]


Gunther Hagleitner commented on HIVE-860:
-----------------------------------------

Can we keep a single bundle for the hive internal pieces? I think that's 
orthogonal to the caching effort and seems more efficient to me than breaking 
it all into smaller bits and also let's us shade what needs shading. It also 
doesn't change how we handle these things as drastically. 

Seems in pig they made the caching optional - can we do that too? In case 
someone has issues with caching it in the user directory? 

Finally a thought for file formats:. It would be nice to only pull the 
dependencies when they are actually needed not every time you run a query. That 
way you're not penalized for adding as many as you want and external serdes can 
play too. We could extend the serde API  with an optional call to retrieve 
additional jars to be localized.

> Persistent distributed cache
> ----------------------------
>
>                 Key: HIVE-860
>                 URL: https://issues.apache.org/jira/browse/HIVE-860
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Zheng Shao
>            Assignee: Brock Noland
>             Fix For: 0.13.0
>
>         Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, 
> HIVE-860.patch, HIVE-860.patch
>
>
> DistributedCache is shared across multiple jobs, if the hdfs file name is the 
> same.
> We need to make sure Hive put the same file into the same location every time 
> and do not overwrite if the file content is the same.
> We can achieve 2 different results:
> A1. Files added with the same name, timestamp, and md5 in the same session 
> will have a single copy in distributed cache.
> A2. Filed added with the same name, timestamp, and md5 will have a single 
> copy in distributed cache.
> A2 has a bigger benefit in sharing but may raise a question on when Hive 
> should clean it up in hdfs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-860) Persistent distributed cache

Reply via email to