[
https://issues.apache.org/jira/browse/HIVE-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028046#comment-14028046
]
Mithun Radhakrishnan commented on HIVE-7195:
--------------------------------------------
I've been trying to solve the problem from the other end in HCatalog, I.e.
registering partitions in the metastore, for data that was written to HDFS
outside of Hive/HCatalog (e.g. through an ingestion service like Apache Falcon,
etc.) There were several points at which I wished we had an abstraction for a
"partition-spec", at the metastore level (if not at the ObjectStore level.)
It would be cool to have parallel functions like the following in the
HiveMetaStore(Client) interface:
{code}
public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ;
public int add_partitions( PartitionSpec new_parts ) throws ... ;
{code}
where the PartitionSpec looks like:
{code}
public interface PartitionSpec {
public List<Partition> getPartitions();
public List<String> getPartNames();
public Iterator<Partition> getPartitionIter();
public Iterator<String> getPartNameIter();
}
{code}
The DefaultPartitionSpec composes a List<Partition>.
An HDFSDirBasedPartitionSpec could be implemented to store a root-level
partition-dir, and return Partition objects via globStatus() on HDFS. I would
use this as an argument to addPartitions(PartitionSpec), to avoid having to
specify all partitions explicitly. This avoids a bunch of thrift-serialization
and traffic over the wire.
A future PartitionSpec could choose to compose other PartitionSpecs.
HiveMetaStoreClient.listPartitions() could choose to return a PartitionSpec
that composes several Partition objects that use the same StorageDescriptor
instance, so that 10000 partitions with nearly the same SD don't repeat the
redundant bits.
I haven't worked out the nuts-and-bolts completely. I'll put a more complete
proposal out on a separate JIRA. I think this will have value for both
listPartitions() (i.e. read) and addPartitions() (i.e. write). I'd value your
opinion on the approach.
> Improve Metastore performance
> -----------------------------
>
> Key: HIVE-7195
> URL: https://issues.apache.org/jira/browse/HIVE-7195
> Project: Hive
> Issue Type: Improvement
> Reporter: Brock Noland
> Priority: Critical
>
> Even with direct SQL, which significantly improves MS performance, some
> operations take a considerable amount of time, when there are many partitions
> on table. Specifically I believe the issue:
> * When a client gets all partitions we do not send them an iterator, we
> create a collection of all data and then pass the object over the network in
> total
> * Operations which require looking up data on the NN can still be slow since
> there is no cache of information and it's done in a serial fashion
> * Perhaps a tangent, but our client timeout is quite dumb. The client will
> timeout and the server has no idea the client is gone. We should use
> deadlines, i.e. pass the timeout to the server so it can calculate that the
> client has expired.
--
This message was sent by Atlassian JIRA
(v6.2#6252)