Hi, Hope that this is to the correct list. Apologies if not.
I am using Hive 0.11.0 and Hadoop 1.0.4. My goal is to get my Hive queries running without Map/Reduce but using my custom indexes. To this end I have been building Hive version 13 from source and working through the sources to see what I can do. I can see that the non-M/R path through Hive splits off really early. I can see that in SemanticAnalyzer.java if it determines that a FetchTask is sufficient for the query then the genMapRedTasks method returns really early and it never gets near the code that uses indexes. I have also followed the code through the index code and I can see that in IndexWhereProcessor.java an index can insert a "index query" task to run before the main query. (By also calling the queryContext setIndexInputFormat and setIndexIntermediateFile methods it can redirect the main query to pick up the data generated by the index.) So I can see two approaches to achieve my goal. 1) I can modify the FetchTask path to support the use of indexes. 2) I can allow the query to start down the Map/Reduce path and then I can arrange for my index code to trash the original query completely and replace it with a query that will run as a FetchTask that will do what I want. Of course there are pros and cons to both of these approaches. 1) This approach has the advantage that I don't need to change the current index path at all and so there's much less likely that I will damage it. However I will probably end up replicating some of the existing index code, which is not desirable. Also I am not sufficiently au fait with the Hive code to feel confident that I would make such a major change in the way that a real Hive developer might. 2) This approach has the advantage that I am building on top of the existing index infrastructure and so I probably will end up writing less code. However it means that my queries will run once as Map Reduce and again as FetchTasks which will make them slower than I would like. The approach is also more complicated than I would like. And I don't really know how cleanly I can "abort" the initial query and replace it with a FetchTask. (if, indeed, this is possible.) Obviously at some point I would like for my changes to get submitted back into the main Hive source and so I want maximize the chances that they will be viewed positively. Does anyone have any opinions or advice to offer? Regards, Peter Marron Senior Developer Trillium Software, A Harte Hanks Company Theale Court, 1st Floor, 11-13 High Street Theale RG7 5AH +44 (0) 118 940 7609 office +44 (0) 118 940 7699 fax [https://4b2685446389bc779b46-5f66fbb59518cc4fcae8900db28267f5.ssl.cf2.rackcdn.com/trillium.png]<http://www.trilliumsoftware.com/> trilliumsoftware.com<http://www.trilliumsoftware.com/> / linkedin<http://www.linkedin.com/company/17710> / twitter<https://twitter.com/trilliumsw> / facebook<http://www.facebook.com/HarteHanks>