[jira] Commented: (HIVE-1938) Cost Based Query optimization in Hive

bharath v (JIRA) Tue, 01 Feb 2011 00:04:56 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989084#comment-12989084
 ]


bharath v commented on HIVE-1938:
---------------------------------

I have a few basic ideas about join costs. Consider the case of Common Join(We 
can extend it to other joins) for which I propose a mathematical approach.


We can consider disk IO and Communication cost as the major bottlenecks and 
optimize the join based on these factors.The challenge here is in predicting 
the size of the intermediate tables output ( while joining more than 2 tables).

Suppose the there are "N" machines in the Hive-Hadoop cluster.

Suppose we are joining 2 table "A" on column "a" with  table "B" ob column 
"b".Table A contains n1 rows and table B has n2 rows. Suppose A.a has "d1" 
distinct values and B.b has "d2" distinct values and suppose d1<d2 without loss 
of generality. The cost computation of a single join can be divided into 3 
phases 

1) Map phase : 
   Both the tables are scanned from the disks once and we can assume that this 
is done parallely on N machines. So total disk IO cost is estimated as 
O((n1+n2)/N) (Assuming constant page size)

2) Shuffle Phase:
   The cost estimation during this phase depends on the type of join. In the 
worst case every row gets transferred to a reducer on different machine (other 
than the one on which it resides). This results in movement of every row to a 
different machine . Assuming some constant ping-time (that we get after making 
some avg calculations) the cost is O(k*(n1+n2)) where this k is 
"cost/unit-transfer" of data. Actual shuffle costs may be less than this due to 
"maximizing locality" effect in hadoop MR. Iam just providing an upper bound 
and we can surely improve upon this depending on the type of join.

For eg: Consider map-side join we can add the cost of moving the smaller table 
to all the mappers = O(k*N*N3) where N3 is no. of rows in the smaller table.

3) Reduce phase :

In each reduce we just do a simple nested-loop join. Assuming "AVERAGE 
DISTRIBUTION" of data, we get (n1/d1) rows of "A" and (n2/d2) rows of "B" per a 
single call of reducer and cost for this is O(n1/d1*n2/d2) and there can be 
at-max d1/N sequential reducers running . So multiplying that factor, the total 
cost is O((n1*n2)/(N*d2)).

The above cost is as a result of assuming uniform distribution of data per 
table. We can improve this by maintaining histograms of statistics per node by 
which we can improve upon this calculation.

After calculation of join cost we need to predict the size of resultant join 
table so that it's statistics can be used in continuing the above procedure for 
the rest of the tables.

Estimating the result table size:

The resultant table will have O(n1*n2/d1) rows assuming that the distribution 
of A.a and B.b 's distinct values is same. This is where we can improve a great 
deal by maintaining good statistics so that we can predict correct size of the 
resultant table.

We can follow a simple system-R approach of dynamic programming where we choose 
the best-ones in each iteration using the above cost formulae.

This is a pretty long post already .. This is the basic approach and we can 
improve upon this further depending on the feedback I get.

I presented this work as a poster at ACM-SIGMOD 2010 and I got some positive 
feedback.





> Cost Based Query optimization in Hive
> -------------------------------------
>
>                 Key: HIVE-1938
>                 URL: https://issues.apache.org/jira/browse/HIVE-1938
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>         Environment: *nix,java
>            Reporter: bharath v
>
> Current optimization in Hive is just rule-based and involves applying a set 
> of rules on the Plan tree. This depends on hints given by the user (which may 
> or may-not be correct) and might result in execution of costlier plans.So 
> this jira aims at building a cost-model which can give a good estimate 
> various plans before hand (using some meta-data already collected) and we can 
> choose the best plan which incurs the least cost.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1938) Cost Based Query optimization in Hive

Reply via email to