Re: Review Request: Use sorted nature of compact indexes

namit jain Tue, 01 Nov 2011 11:00:38 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/#review2974
-----------------------------------------------------------




trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/2605/#comment6684>

    The default can be true



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6692>

    nit: spelling 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
<https://reviews.apache.org/r/2605/#comment6685>

    More comments here.
    It would be useful to describe when is a binary search
    performed.



trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6694>

    This should not be hard-coded.
    If user wanted HiveInputFormat, it should be 
    HiveSortedInputFormat and same for CombineHiveSortedInputFormat.
    
    Do we need a new class, or can sorted be a 
    property of input format ? Then, it should automatcally
    work for both hiveIF and combinehiveIF



trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6695>

    use the term index column instead of non-partition column.
    
    Who is using the function findNonPartitionFilterWork.
    It is not modifying any internal structure, and the 
    return value is not used
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
<https://reviews.apache.org/r/2605/#comment6696>

    I am confused - what if the filter contains multiple
    non partition column predicates ?



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java
<https://reviews.apache.org/r/2605/#comment6698>

    As mentioned before, it would be good if this also works with CombineHiveIF


- namit


On 2011-10-29 01:39:50, Kevin Wilfong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/2605/
> -----------------------------------------------------------
> 
> (Updated 2011-10-29 01:39:50)
> 
> 
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
> 
> 
> Summary
> -------
> 
> The CompactIndexHandler determines if the reentrant query it creates is a 
> candidate for using the fact the index is sorted (it has an appropriate 
> number of non-partition conditions, and the query plan is of the form 
> expected).  It sets the input format to HiveSortedInputFormat, and marks the 
> FilterOperator for the non-partition condition.
> 
> The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist 
> of data from a single file, and its record reader is 
> HiveBinarySearchRecordReader.  HiveBinarySearchRecordReader starts by 
> assuming it is performing a binary search.  It sets the appropriate flags in 
> IOContext, which acts as the means of communication between the 
> FilterOperators and the record reader.  The non-partition FilterOperator is 
> responsible for executing a comparison between the value in the row and 
> column of interest and the constant.  It also provides the type of the 
> generic UDF.  It sets this data in the IOContext.  As long as the binary 
> search continues the FilterOperators do not forward rows to the operators 
> below them.  The record reader uses the comparison and the type of the 
> generic UDF to execute a binary search on the underlying RCFile until it 
> finds the block of interest, or determines that if any block is of interest 
> it is the last one.  The search then proceeds linearly from the beginning of 
> the identified block.  If ever in the binary search a problem occurs, like 
> the comparison fails for some reason, a linear search begins from the 
> beginning of the data which has yet to be eliminated.
> 
> Regardless of whether or not a binary search is performed, the record reader 
> attempts to end the linear search as soon as it can based on the comparison 
> and the type of the generic UDF.
> 
> 
> This addresses bug HIVE-2535.
>     https://issues.apache.org/jira/browse/HIVE-2535
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 
>   trunk/conf/hive-default.xml 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java
>  1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 
> 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
>  1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
>  PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 
> 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java 
> PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java 
> 1183507 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
>  1183507 
>   
> trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
>  PRE-CREATION 
>   
> trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java
>  PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/2605/diff
> 
> 
> Testing
> -------
> 
> I added a test to verify the functionality of the 
> HiveBinarySearchRecordReader.
> 
> I also added a .q file to test that this returns the correct results when the 
> underlying index is stored in an RCFile and when it is stored in as a text 
> file, with all of the supported operators.
> 
> I ran the .q files to verify they still pass.
> 
> I ran some queries to verify there was a CPU benefit to doing this.  I saw as 
> much as a 45% reduction in the total CPU used by the map reduce job to scan 
> the index, for a large data set. 
> 
> 
> Thanks,
> 
> Kevin
> 
>

Re: Review Request: Use sorted nature of compact indexes

Reply via email to