[ https://issues.apache.org/jira/browse/HIVE-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028046#comment-14028046 ]
Mithun Radhakrishnan commented on HIVE-7195: -------------------------------------------- I've been trying to solve the problem from the other end in HCatalog, I.e. registering partitions in the metastore, for data that was written to HDFS outside of Hive/HCatalog (e.g. through an ingestion service like Apache Falcon, etc.) There were several points at which I wished we had an abstraction for a "partition-spec", at the metastore level (if not at the ObjectStore level.) It would be cool to have parallel functions like the following in the HiveMetaStore(Client) interface: {code} public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ; public int add_partitions( PartitionSpec new_parts ) throws ... ; {code} where the PartitionSpec looks like: {code} public interface PartitionSpec { public List<Partition> getPartitions(); public List<String> getPartNames(); public Iterator<Partition> getPartitionIter(); public Iterator<String> getPartNameIter(); } {code} The DefaultPartitionSpec composes a List<Partition>. An HDFSDirBasedPartitionSpec could be implemented to store a root-level partition-dir, and return Partition objects via globStatus() on HDFS. I would use this as an argument to addPartitions(PartitionSpec), to avoid having to specify all partitions explicitly. This avoids a bunch of thrift-serialization and traffic over the wire. A future PartitionSpec could choose to compose other PartitionSpecs. HiveMetaStoreClient.listPartitions() could choose to return a PartitionSpec that composes several Partition objects that use the same StorageDescriptor instance, so that 10000 partitions with nearly the same SD don't repeat the redundant bits. I haven't worked out the nuts-and-bolts completely. I'll put a more complete proposal out on a separate JIRA. I think this will have value for both listPartitions() (i.e. read) and addPartitions() (i.e. write). I'd value your opinion on the approach. > Improve Metastore performance > ----------------------------- > > Key: HIVE-7195 > URL: https://issues.apache.org/jira/browse/HIVE-7195 > Project: Hive > Issue Type: Improvement > Reporter: Brock Noland > Priority: Critical > > Even with direct SQL, which significantly improves MS performance, some > operations take a considerable amount of time, when there are many partitions > on table. Specifically I believe the issue: > * When a client gets all partitions we do not send them an iterator, we > create a collection of all data and then pass the object over the network in > total > * Operations which require looking up data on the NN can still be slow since > there is no cache of information and it's done in a serial fashion > * Perhaps a tangent, but our client timeout is quite dumb. The client will > timeout and the server has no idea the client is gone. We should use > deadlines, i.e. pass the timeout to the server so it can calculate that the > client has expired. -- This message was sent by Atlassian JIRA (v6.2#6252)