[jira] [Updated] (HIVE-2050) batch processing partition pruning process

Ning Zhang (JIRA) Sun, 27 Mar 2011 23:02:47 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ning Zhang updated HIVE-2050:
-----------------------------

    Attachment: HIVE-2050.2.patch

There are 2 major changes from the last patch:
 - added a parameter hive.metastore.batch.retrieve.max to control the maximum 
number of partitions can be retrieved from the metastore in one batch (default 
300). In Hive.getPartitionsByNames(), the input partition name list are 
separated into sublists and call the metastore API for each sublist.
 - one of the most time consuming DB operations is the retrieve the sub-classes 
of MPartition. In particular the list of FieldSchema are retrieved for each 
partition and they are never used (the table's field schema is used for all 
partitions). So one of the changes here is to omit the retrieval of FieldSchema 
and make the table's fieldschema as the partitions. If later we need the 
partition's fieldschema for schema evaluation, we should add another 
function/flag for that. 

These changes reduce memory by 50% and CPU by 20%. 

The review board is also updated with the Java-only patch. 

> batch processing partition pruning process
> ------------------------------------------
>
>                 Key: HIVE-2050
>                 URL: https://issues.apache.org/jira/browse/HIVE-2050
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-2050.2.patch, HIVE-2050.patch
>
>
> For partition predicates that cannot be pushed down to JDO filtering 
> (HIVE-2049), we should fall back to the old approach of listing all partition 
> names first and use Hive's expression evaluation engine to select the correct 
> partitions. Then the partition pruner should hand Hive a list of partition 
> names and return a list of Partition Object (this should be added to the Hive 
> API). 
> A possible optimization is that the the partition pruner should give Hive a 
> set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and 
> the JDO query should be formulated as range queries. Range queries are 
> possible because the first step list all partition names in sorted order. 
> It's easy to come up with a range and it is guaranteed that the JDO range 
> query results should be equivalent to the query with a list of partition 
> names. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2050) batch processing partition pruning process

Reply via email to