Bharath, This would be great.
Why don¹t you write up something about how you are planning to proceed ? File a new jira and load some design notes/spec. there. We can definitely sync up. from there. This feature would be very useful to the community - We, at facebook, Would definitely like to use it. Thanks, -namit On 1/31/11 9:50 PM, "bharath vissapragada" <bharathvissapragada1...@gmail.com> wrote: >Hi Ning,Anja, > >I am doing my Masters thesis on this topic . I have implemented all >SQL features like joins , selects etc on top of Hadoop (before knowing >about Hive) and we have derived some basic cost-models for join >re-ordering which seem to be working fine on some basic scales of TPCH >datasets .. Later I came to know about Hive and I am trying to >implement the same in Hive . > >Right now I am in the process of understanding Hive's source and I am >almost done with "ql" package. I think it would be great if you guys >can help us in this regard .. I am a bit confused about the >implementation of joins and once i'm done with that , I can modify the >"joinReorder" of Optimizer package by using the cost-formulae and >metadata. It would be a great opportunity to work with you guys at fb >and contribute to Hive.. > >Thanks >Bharath,V >4th year Undergrad,IIIT Hyderabad. >w: http://research.iiit.ac.in/~bharath.v > >On Tue, Feb 1, 2011 at 9:22 AM, Ning Zhang <nzh...@fb.com> wrote: >> Hi Anja, >> >> As you noticed Hive only have limited supports for cost-baesd >>optimization. One of the reasons is that Hive used to have very small >>number of optional execution plans to choose from. One exception is >>mapjoin vs common joins. Liying Tang had some work on his last intern to >>convert common joins to mapjoin in a rule-based fashion. One of his >>future works is to automatically convert common join to mapjoins based >>on stats. There are also ongoing work on indexes on Hive. With the >>support of indexes, CBO will be much needed. >> >> In order for a decent CBO to work, we need stats and cost models. There >>are some work in stats. Table/partition level stats has already been >>supported. There is a JIRA open for column level stats (HIVE-1362). Cost >>model is much more complex in Hadoop environment and closely dependent >>on the mapjoin/index implementations. Given al these in place, we can >>then talk about plan enumeration etc. >> >> So yes, we are interested in CBO, but it is a large area and many >>missing pieces need to be filled in Hive. If you have particular >>interest in some area, you can propose your ideas in >>hive-...@hive.apache.org mailing list or even apply for an intern at FB >>if you would like to work closely with us. >> >> Thanks, >> Ning >> >> On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote: >> >>> Hi! >>> >>> I'm a graduate student from Georgia Tech and I'm working with Hive for >>>a research project. I am interested in query optimization and the Hive >>>MetaStore in that context. Working through the documentation and code, >>>I noticed that the implementation right now is using a rule-based >>>optimization system. Therefore, I was wondering whether cost-based >>>query optimization will be a future task in the development of Hive and >>>if it would be possible for me to cooperate with the developers of Hive >>>to advance the project in general. >>> >>> Best regards, >>> Anja Gruenheid >> >>