[ https://issues.apache.org/jira/browse/HIVE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904532#comment-13904532 ]
Gunther Hagleitner commented on HIVE-860: ----------------------------------------- Can we keep a single bundle for the hive internal pieces? I think that's orthogonal to the caching effort and seems more efficient to me than breaking it all into smaller bits and also let's us shade what needs shading. It also doesn't change how we handle these things as drastically. Seems in pig they made the caching optional - can we do that too? In case someone has issues with caching it in the user directory? Finally a thought for file formats:. It would be nice to only pull the dependencies when they are actually needed not every time you run a query. That way you're not penalized for adding as many as you want and external serdes can play too. We could extend the serde API with an optional call to retrieve additional jars to be localized. > Persistent distributed cache > ---------------------------- > > Key: HIVE-860 > URL: https://issues.apache.org/jira/browse/HIVE-860 > Project: Hive > Issue Type: Improvement > Affects Versions: 0.12.0 > Reporter: Zheng Shao > Assignee: Brock Noland > Fix For: 0.13.0 > > Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, > HIVE-860.patch, HIVE-860.patch > > > DistributedCache is shared across multiple jobs, if the hdfs file name is the > same. > We need to make sure Hive put the same file into the same location every time > and do not overwrite if the file content is the same. > We can achieve 2 different results: > A1. Files added with the same name, timestamp, and md5 in the same session > will have a single copy in distributed cache. > A2. Filed added with the same name, timestamp, and md5 will have a single > copy in distributed cache. > A2 has a bigger benefit in sharing but may raise a question on when Hive > should clean it up in hdfs. -- This message was sent by Atlassian JIRA (v6.1.5#6160)