I am playing with Hive indexing and a little discouraged by the gap between the potential seen and the amount of documentation around indexing. I am running Hive 0.9 and started playing with indexing as follows:
I have a table logs that has a bunch of fields but for this, lets say three. sessionutc, srcip, dstip and partitioned by DAY. CREATE TABLE logs(sessionutc STRING, srcip STRING, dstip STRING) PARTITIONED by (day STRING) The field I am hoping to index is srcip, so I created this: CREATE INDEX idx_srcip ON TABLE logs(srcip) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; ALTER INDEX idx_srcip ON pcaps REBUILD; This alter index ran overnight.... Also be warned, the data in my table is 522GB. The buidling of index took all night and wrote a ton of data (49 GB) to the hive history file in the CLI... that just aint right if you ask me. :) Can we limit that somehow? The actual index table ended up being around 1.8 GB on 522GB of data. That wasn't too bad I guess. Then time for queries... I thought Hive0.9 just supported indexes on queries, the old school DB guy just ran a query SELECT sessionutc, srcip, dstip FROM logs WHERE srcip='127.0.0.1' Started running 1028 Map Tasks... obviously not usiung an index, takes forever, hmmm.. So I google, and find this: INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname` , `_offsets` FROM web__logs_idx_srcip__ WHERE srcip = '127.0.0.1'; SET hive.index.compact.file=/tmp/index_result; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat; SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' 43 Map task, done quickly. Indexes WORK! WOOOOOOO. So Do Indexes work when you have a non-indexed field in your where clause. Hypothesis: If you "and" your indexed field, it will work (like partition pruning) SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' and sessionutc like '2012-04-04%' 43 Map Tasks: Quick Hits Index Hypothesis: True Hypothesis If you "or" your indexed field it will not work (like partition pruning) SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' or sessionutc like '2012-04-04%' 43 Map Tasks: Quick Hits Index Hypothesis: False What the heck? HOW DOES THAT WORK? That's impossible. The data returned is truly that data without the srcip = 127.0.0.1 It's from all partitions. I am so confused there. Ok: Other questions. How are indexed updated? How does INSERT OVERWRITE or INSERT APPEND affect currently built indexes? Are indexes rebuilt automatically or do you have to have separate job to rebuild the indexes on partitions that you've added data too? If index updating is automagic, does it slow INSERT times? Need to do some testing here. Also: Is there a better way to enable Index hitting automagically without having to do the separate index query and the two SET commands prior to my query I want to hit the index? I'd like to utilize the indices to help our operations staff write better queries (read use less cluster resources) but teaching that process may be difficult. Any other good non-googlable sources of information on indexes, The reason I posted my results and my questions here is the general lack of information around this topic.