I'd like to revise the Indexing <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing> and IndexDev <https://cwiki.apache.org/confluence/display/Hive/IndexDev> docs in the wiki to include this information (as well as information from a previous thread, if I can find it) so people won't be misled into using indexes inappropriately.
But it might be more efficient for Gopal or another expert to do the revisions. Otherwise I would need careful reviews to make sure I don't garble things. -- Lefty On Tue, Jan 5, 2016 at 3:55 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > >So in a nutshell in Hive if "external" indexes are not used for improving > >query response, what value they add and can we forget them for now? > > The builtin indexes - those that write data as smaller tables are only > useful in a pre-columnar world, where the indexes offer a huge reduction > in IO. > > Part #1 of using hive indexes effectively is to write your own > HiveIndexHandler, with usesIndexTable=false; > > And then write a IndexPredicateAnalyzer, which lets you map arbitrary > lookups into other range conditions. > > Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA" > which consolidates the "internal" index into an external store (HBase). > > Some of the index data now lives in the HBase metastore, so that the > inclusion/exclusion of whole partitions can be done off the consolidated > index. > > https://issues.apache.org/jira/browse/HIVE-11676 > > > The experience from BI workloads run by customers is that in general, the > lookup to the right "slice" of data is more of a problem than the actual > aggregate. > > And that for a workhorse data warehouse, this has to survive even if > there's a non-stop stream of updates into it. > > Cheers, > Gopal > > >