Re: Review Request: Use sorted nature of compact indexes

Kevin Wilfong Tue, 01 Nov 2011 11:17:33 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2988
-----------------------------------------------------------




trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6700>

    findNonPartitionFilter modifies the conf of the FilterOperator for the 
index column to mark it as the one on which the data is sorted.
    
    See line 239



trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6703>

    the boolean useSorted used at line 186 in the if statement surrounding the 
call to this method is set to true iff the number of index columns being 
filtered on is 1 see line 277.


- Kevin


On 2011-10-29 01:39:50, Kevin Wilfong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/2605/
> -----------------------------------------------------------
> 
> (Updated 2011-10-29 01:39:50)
> 
> 
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
> 
> 
> Summary
> -------
> 
> The CompactIndexHandler determines if the reentrant query it creates is a 
> candidate for using the fact the index is sorted (it has an appropriate 
> number of non-partition conditions, and the query plan is of the form 
> expected).  It sets the input format to HiveSortedInputFormat, and marks the 
> FilterOperator for the non-partition condition.
> 
> The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist 
> of data from a single file, and its record reader is 
> HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by 
> assuming it is performing a binary search.  It sets the appropriate flags in 
> IOContext, which acts as the means of communication between the 
> FilterOperators and the record reader.  The non-partition FilterOperator is 
> responsible for executing a comparison between the value in the row and 
> column of interest and the constant.  It also provides the type of the 
> generic UDF.  It sets this data in the IOContext.  As long as the binary 
> search continues the FilterOperators do not forward rows to the operators 
> below them.  The record reader uses the comparison and the type of the 
> generic UDF to execute a binary search on the underlying RCFile until it 
> finds the block of interest, or determines that if any block is of interest 
> it is the last one.  The search then proceeds linearly from the beginning of 
> the identified block.  If ever in the binary search a problem occurs, like 
> the comparison fails for some reason, a linear search begins from the 
> beginning of the data which has yet to be eliminated.
> 
> Regardless of whether or not a binary search is performed, the record reader 
> attempts to end the linear search as soon as it can based on the comparison 
> and the type of the generic UDF.
> 
> 
> This addresses bug HIVE-2535.
>     https://issues.apache.org/jira/browse/HIVE-2535
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
>   trunk/conf/hive-default.xml 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java
>  1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 
> 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
>  1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
>  PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 
> 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java 
> PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 
> 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
>  1183507 
>   
> trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
>  PRE-CREATION 
>   
> trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java
>  PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/2605/diff
> 
> 
> Testing
> -------
> 
> I added a test to verify the functionality of the 
> HiveBinarySearchRecordReader.
> 
> I also added a .q file to test that this returns the correct results when the 
> underlying index is stored in an RCFile and when it is stored in as a text 
> file, with all of the supported operators.
> 
> I ran the .q files to verify they still pass.
> 
> I ran some queries to verify there was a CPU benefit to doing this.  I saw as 
> much as a 45% reduction in the total CPU used by the map reduce job to scan 
> the index, for a large data set. 
> 
> 
> Thanks,
> 
> Kevin
> 
>

Re: Review Request: Use sorted nature of compact indexes

Reply via email to