I do not have answers to any of your questions, but I appreciate you raising 
them. My team is very interested in Hive indexing as well, so I look forward to 
this discussion.

Chuck Connell
Nuance R&D Data Team
Burlington, MA


From: John Omernik [mailto:j...@omernik.com]
Sent: Thursday, July 26, 2012 10:40 AM
To: user@hive.apache.org
Subject: Hive 0.9 and Indexing

I am playing with Hive indexing and a little discouraged by the gap between the 
potential seen and the amount of documentation around indexing. I am running 
Hive 0.9 and started playing with indexing as follows:

I have a table logs that has a bunch of fields but for this, lets say three. 
sessionutc, srcip, dstip and partitioned by DAY.

CREATE TABLE logs(sessionutc STRING, srcip STRING, dstip STRING)
PARTITIONED by (day STRING)

The field I am hoping to index is srcip, so I created this:

CREATE INDEX idx_srcip ON TABLE logs(srcip) as 
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED 
REBUILD;

ALTER INDEX idx_srcip ON pcaps REBUILD;

This alter index ran overnight.... Also be warned, the data in my table is 
522GB.  The buidling of index took all night and wrote a ton of data (49 GB) to 
the hive history file in the CLI... that just aint right if you ask me. :) Can 
we limit that somehow?

The actual index table ended up being around 1.8 GB on 522GB of data.  That 
wasn't too bad I guess.

Then time for queries... I thought Hive0.9 just supported indexes on queries, 
the old school DB guy just ran a query

SELECT sessionutc, srcip, dstip FROM logs WHERE srcip='127.0.0.1'

Started running 1028 Map Tasks... obviously not usiung an index, takes forever, 
hmmm..

So I google, and find this:

INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname` ,  
`_offsets` FROM web__logs_idx_srcip__ WHERE srcip = '127.0.0.1';
SET hive.index.compact.file=/tmp/index_result;
SET 
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1'

43 Map task, done quickly. Indexes WORK! WOOOOOOO.

So Do Indexes work when you have a non-indexed field in your where clause.

Hypothesis: If you "and" your indexed field, it will work (like partition 
pruning)

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' and 
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: True

Hypothesis If you "or" your indexed field it will not work (like partition 
pruning)
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' or 
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: False

What the heck? HOW DOES THAT WORK? That's impossible. The data returned is 
truly that data without the srcip = 127.0.0.1 It's from all partitions.  I am 
so confused there.

Ok: Other questions.

How are indexed updated?  How does INSERT OVERWRITE or INSERT APPEND affect 
currently built indexes? Are indexes rebuilt automatically or do you have to 
have separate job to rebuild the indexes on partitions that you've added data 
too? If index updating is automagic, does it slow INSERT times? Need to do some 
testing here.

Also: Is there a better way to enable Index hitting automagically without 
having to do the separate index query and the two SET commands prior to my 
query I want to hit the index? I'd like to utilize the indices to help our 
operations staff write better queries (read use less cluster resources) but 
teaching that process may be difficult.

Any other good non-googlable sources of information on indexes, The reason I 
posted my results and my questions here is the general lack of information 
around this topic.






Reply via email to