[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

[email protected] (Commented) (JIRA) Tue, 01 Nov 2011 11:07:59 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141408#comment-13141408
 ]

[email protected] commented on HIVE-2535:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2987
-----------------------------------------------------------

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
<https://reviews.apache.org/r/2605/#comment6699>

    do you need to override these functions ?
    They should be same as HiveRR

- namit

On 2011-10-29 01:39:50, Kevin Wilfong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2605/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-10-29 01:39:50)
bq.  
bq.  
bq.  Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  The CompactIndexHandler determines if the reentrant query it creates is a 
candidate for using the fact the index is sorted (it has an appropriate number 
of non-partition conditions, and the query plan is of the form expected).  It 
sets the input format to HiveSortedInputFormat, and marks the FilterOperator 
for the non-partition condition.
bq.  
bq.  The HiveSortedInputFormat is extends HiveInputFormat, so its splits 
consist of data from a single file, and its record reader is 
HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by assuming 
it is performing a binary search.  It sets the appropriate flags in IOContext, 
which acts as the means of communication between the FilterOperators and the 
record reader.  The non-partition FilterOperator is responsible for executing a 
comparison between the value in the row and column of interest and the 
constant.  It also provides the type of the generic UDF.  It sets this data in 
the IOContext.  As long as the binary search continues the FilterOperators do 
not forward rows to the operators below them.  The record reader uses the 
comparison and the type of the generic UDF to execute a binary search on the 
underlying RCFile until it finds the block of interest, or determines that if 
any block is of interest it is the last one.  The search then proceeds linearly 
from the beginning of the identified block.  If ever in the binary search a 
problem occurs, like the comparison fails for some reason, a linear search 
begins from the beginning of the data which has yet to be eliminated.
bq.  
bq.  Regardless of whether or not a binary search is performed, the record 
reader attempts to end the linear search as soon as it can based on the 
comparison and the type of the generic UDF.
bq.  
bq.  
bq.  This addresses bug HIVE-2535.
bq.      https://issues.apache.org/jira/browse/HIVE-2535
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
bq.    trunk/conf/hive-default.xml 1183507 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java
 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 
1183507 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
 1183507 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
 PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 
1183507 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java 
PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 
1183507 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
 1183507 
bq.    
trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
 PRE-CREATION 
bq.    
trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java
 PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q 
PRE-CREATION 
bq.    
trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out 
PRE-CREATION 
bq.  
bq.  Diff: https://reviews.apache.org/r/2605/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  I added a test to verify the functionality of the 
HiveBinarySearchRecordReader.
bq.  
bq.  I also added a .q file to test that this returns the correct results when 
the underlying index is stored in an RCFile and when it is stored in as a text 
file, with all of the supported operators.
bq.  
bq.  I ran the .q files to verify they still pass.
bq.  
bq.  I ran some queries to verify there was a CPU benefit to doing this.  I saw 
as much as a 45% reduction in the total CPU used by the map reduce job to scan 
the index, for a large data set. 
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Kevin
bq.  
bq.

> Use sorted nature of compact indexes
> ------------------------------------
>
>                 Key: HIVE-2535
>                 URL: https://issues.apache.org/jira/browse/HIVE-2535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2535.1.patch.txt
>
>
> Compact indexes are sorted based on the indexed columns, but we are not using 
> this fact when we access the index.
> To start with, if the index is stored as an RC file, and if the predicate 
> being used to access the index consists of only one non-partition condition 
> using one of the operators >,>=,<,<=,= we could use a binary search (if 
> necessary) to find the block to begin scanning for unfiltered rows, and we 
> could use the result of comparing the value in the column with the constant 
> (this is necessarily the form of a predicate which is optimized using an 
> index) to determine when we have found all the rows which will be unfiltered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2535) Use sorted nature of compact indexes

Reply via email to