Re: Bad performance of Hive meta store for tables with large number of partitions

Benyi Wang Fri, 30 Nov 2012 12:24:34 -0800

Does JDO support the features in EclipseLink in this article?

http://java-persistence-performance.blogspot.com/2010/08/batch-fetching-optimizing-object-graph.html



On Fri, Nov 30, 2012 at 11:36 AM, Benyi Wang <bewang.t...@gmail.com> wrote:

> We have some tables with 15K ~ 20K partitions. If I run a query scanning a
> lot of partitions, Hive could use more than 10 minutes to commit the mapred
> job.
>
> The problem is caused by ObjectStore.getPartitionsByNames when Hive
> semantic analyzer tries to prune partitions. This method sends a lot of
> queries to our MySQL database to retrieve ALL information about partitions.
> Because MPartition and MStroageDescriptor are converted into Partition and
> StorageDescriptor, every field will be accessed during conversion, in other
> words, even the fields has nothing to do with partition pruning, such as
> BucketCols. In our case, 10 queries for each partition will be sent to the
> database and each query may take 40ms.
>
> This is known ORM 1+N problem. But it is really bad user experience.
>
> Actually we assembly Partition objects manually, it would only need about
> 10 queries for a group of partitions (default size is 300). In our
> environment, it only needs 40 seconds for 30K partitions: 30K / 300 * 10 *
> 40.
>
> I tried to this way:
>
> 1. Fetch MPartition with fetch group and fetch_size_greedy, so one query
> can get MPartition's primary fields and MStorageDescriptor cached.
> 2. Get all descriptors into a list "msds", run another query to get
> MStorageDescriptor with filter like "msds.contains(this)", all cached
> descriptors will be refreshed in one query instead of n queries.
>
> This works well for 1-1 relations, but not on 1-N relation like
> MPartition.values. I didn't find a way to populate those fields in just one
> query.
>
> Because JDO mapping doesn't work well in the conversion (MPartition ->
> Partition), I'm wondering if it is worth doing like this:
>
> 1. Query each table in SQL directly PARTITIONS, SDS, etcs.
> 2. Assembly Partition objects
>
> This is a hack and the code will be really bad. But I didn't find JDO
> support "FETCH JOIN" or "Batch fetch".
>
> Any thoughts?
>

Re: Bad performance of Hive meta store for tables with large number of partitions

Reply via email to