Re: Indexing in hive

Mark Grover Wed, 16 May 2012 19:28:22 -0700

Hi Ranjith,
Here are the steps for using an index in Hive 0.7.1.

1) Create the index 
CREATE INDEX x ON TABLE t(j) 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD;


2) Build the index (since you specified the 'DEFERRED REBUILD' flag in the 
create index statement

ALTER INDEX x ON t REBUILD;

3) Use the index
If your original query is: select a, count(*) from t where j='and' group by a;
and you wanted to use the index on column j, do something like:

INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` 
FROM default__t_x__ where j='and';
(The name default__t_x__ can be found in the output of step 2. Also, 
/tmp/indexes directory needs to exist in HDFS. You can substitute this to be 
any pre-existing directory in HDFS)
SET hive.index.compact.file=/tmp/indexes/x;
SET 
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
SELECT a, count(*) from t where j='and' group by a;

Since the semantics of this usage make you specify the compact file, I have not 
been able to figure out a way to use multiple indexes in the same query. In 
this case we are using the index on j, the column in the where clause.

I hope you now understand why indexing in Hive is a work in progress:-)

Good luck!
Mark


Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

----- Original Message -----
From: "Ranjith Raghunath" <ranjith.raghuna...@usaa.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Cc: "terry.zhao...@huawei.com" <terry.zhao...@huawei.com>
Sent: Wednesday, May 16, 2012 9:46:23 PM
Subject: Re: Indexing in hive

Thanks Mark, Carl, and Ransom. I really appreciate the answers here. I am using 
Hive 0.7.1 and currently trying to create an index to help with performance 
associated to a particular where clause. I have not set any properties as 
mentioned below. I can try the options you listed below. Based on the feedback, 
I do have a few questions:

1. It seems like if you optimize for a groupby you cannot optimize for the 
where clause. I am reading this correctly?

2. How do you build the index? 

3. Does the build process create a dataset that contains the index keys along 
with an offset value associated with the row? 

4. And I am guessing that you need to use the columns in the same order as it 
is defined in the index?


Thanks,
Ranjith      

----- Original Message -----
From: Mark Grover [mailto:mgro...@oanda.com]
Sent: Wednesday, May 16, 2012 07:52 PM
To: user@hive.apache.org <user@hive.apache.org>
Cc: Zhaojun (Terry) <terry.zhao...@huawei.com>
Subject: Re: Indexing in hive

Ransom,
>From this JIRA (https://issues.apache.org/jira/browse/HIVE-1644), it looks 
>like automatic use of indexes using hive.optimize.index.filter was introduced 
>in Hive 0.8. However, Ranjith seems to be using Hive 0.7.1 which doesn't 
>support those properties.

Ranjith, you need to set the appropriate priorities before calling your query 
to make use of indexes. Are you setting any properties? If so, what?
I will try to dig up what those properties are, in the meanwhile.

Mark

----- Original Message -----
From: "Hezhiqiang (Ransom)" <ransom.hezhiqi...@huawei.com>
To: user@hive.apache.org
Cc: "Zhaojun (Terry)" <terry.zhao...@huawei.com>
Sent: Wednesday, May 16, 2012 8:32:55 PM
Subject: RE: Indexing in hive




“ hive.optimize.index.filter ” is the conf automatically use indexes 

If u set hive.optimize.index.groupby = true. 

It will set hive.optimize.index.filter =false. 

See your configurations. 



And you need to build index after create index. 




Best regards 

Ransom.

Re: Indexing in hive

Reply via email to