Re: Query Optimization in Hive

bharath vissapragada Mon, 31 Jan 2011 21:50:53 -0800

Hi Ning,Anja,

I am doing my Masters thesis on this topic . I have implemented all
SQL features like joins , selects etc on top of Hadoop (before knowing
about Hive) and we have derived some basic cost-models for join
re-ordering which seem to be working fine on some basic scales of TPCH
datasets .. Later I came to know about Hive and I am trying to
implement the same in Hive .


Right now I am in the process of understanding Hive's source and I am
almost done with  "ql" package. I think it would be great if you guys
can help us in this regard .. I am a bit confused about the
implementation of joins and once i'm done with that , I can modify the
"joinReorder" of Optimizer package by using the cost-formulae and
metadata. It would be a great opportunity to work with you guys at fb
and contribute to Hive..

Thanks
Bharath,V
4th year Undergrad,IIIT Hyderabad.
w: http://research.iiit.ac.in/~bharath.v

On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nzh...@fb.com> wrote:
> Hi Anja,
>
> As you noticed Hive only have limited supports for cost-baesd optimization. 
> One of the reasons is that Hive used to have very small number of optional 
> execution plans to choose from. One exception is mapjoin vs common joins. 
> Liying Tang had some work on his last intern to convert common joins to 
> mapjoin in a rule-based fashion. One of his future works is to automatically 
> convert common join to mapjoins based on stats. There are also ongoing work 
> on indexes on Hive. With the support of indexes, CBO will be much needed.
>
> In order for a decent CBO to work, we need stats and cost models. There are 
> some work in stats. Table/partition level stats has already been supported. 
> There is a JIRA open for column level stats (HIVE-1362). Cost model is much 
> more complex in Hadoop environment and closely dependent on the mapjoin/index 
> implementations. Given al these in place, we can then talk about plan 
> enumeration etc.
>
> So yes, we are interested in CBO, but it is a large area and many missing 
> pieces need to be filled in Hive. If you have particular interest in some 
> area, you can propose your ideas in hive-...@hive.apache.org mailing list or 
> even apply for an intern at FB if you would like to work closely with us.
>
> Thanks,
> Ning
>
> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:
>
>> Hi!
>>
>> I'm a graduate student from Georgia Tech and I'm working with Hive for a 
>> research project. I am interested in query optimization and the Hive 
>> MetaStore in that context. Working through the documentation and code, I 
>> noticed that the implementation right now is using a rule-based optimization 
>> system. Therefore, I was wondering whether cost-based query optimization 
>> will be a future task in the development of Hive and if it would be possible 
>> for me to cooperate with the developers of Hive to advance the project in 
>> general.
>>
>> Best regards,
>> Anja Gruenheid
>
>

Re: Query Optimization in Hive

Reply via email to