----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2605/#review2974 -----------------------------------------------------------
trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java <https://reviews.apache.org/r/2605/#comment6684> The default can be true trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java <https://reviews.apache.org/r/2605/#comment6692> nit: spelling trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java <https://reviews.apache.org/r/2605/#comment6685> More comments here. It would be useful to describe when is a binary search performed. trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java <https://reviews.apache.org/r/2605/#comment6694> This should not be hard-coded. If user wanted HiveInputFormat, it should be HiveSortedInputFormat and same for CombineHiveSortedInputFormat. Do we need a new class, or can sorted be a property of input format ? Then, it should automatcally work for both hiveIF and combinehiveIF trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java <https://reviews.apache.org/r/2605/#comment6695> use the term index column instead of non-partition column. Who is using the function findNonPartitionFilterWork. It is not modifying any internal structure, and the return value is not used trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java <https://reviews.apache.org/r/2605/#comment6696> I am confused - what if the filter contains multiple non partition column predicates ? trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java <https://reviews.apache.org/r/2605/#comment6698> As mentioned before, it would be good if this also works with CombineHiveIF - namit On 2011-10-29 01:39:50, Kevin Wilfong wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/2605/ > ----------------------------------------------------------- > > (Updated 2011-10-29 01:39:50) > > > Review request for hive, Yongqiang He, Ning Zhang, and namit jain. > > > Summary > ------- > > The CompactIndexHandler determines if the reentrant query it creates is a > candidate for using the fact the index is sorted (it has an appropriate > number of non-partition conditions, and the query plan is of the form > expected). It sets the input format to HiveSortedInputFormat, and marks the > FilterOperator for the non-partition condition. > > The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist > of data from a single file, and its record reader is > HiveBinarySearchRecordReader. HiveBinarySearchRecordReader starts by > assuming it is performing a binary search. It sets the appropriate flags in > IOContext, which acts as the means of communication between the > FilterOperators and the record reader. The non-partition FilterOperator is > responsible for executing a comparison between the value in the row and > column of interest and the constant. It also provides the type of the > generic UDF. It sets this data in the IOContext. As long as the binary > search continues the FilterOperators do not forward rows to the operators > below them. The record reader uses the comparison and the type of the > generic UDF to execute a binary search on the underlying RCFile until it > finds the block of interest, or determines that if any block is of interest > it is the last one. The search then proceeds linearly from the beginning of > the identified block. If ever in the binary search a problem occurs, like > the comparison fails for some reason, a linear search begins from the > beginning of the data which has yet to be eliminated. > > Regardless of whether or not a binary search is performed, the record reader > attempts to end the linear search as soon as it can based on the comparison > and the type of the generic UDF. > > > This addresses bug HIVE-2535. > https://issues.apache.org/jira/browse/HIVE-2535 > > > Diffs > ----- > > trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507 > trunk/conf/hive-default.xml 1183507 > > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java > 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java > 1183507 > > trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java > 1183507 > > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java > PRE-CREATION > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java > 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSortedInputFormat.java > PRE-CREATION > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java > 1183507 > trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507 > > trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java > 1183507 > > trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java > PRE-CREATION > > trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java > PRE-CREATION > trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q > PRE-CREATION > trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out > PRE-CREATION > > Diff: https://reviews.apache.org/r/2605/diff > > > Testing > ------- > > I added a test to verify the functionality of the > HiveBinarySearchRecordReader. > > I also added a .q file to test that this returns the correct results when the > underlying index is stored in an RCFile and when it is stored in as a text > file, with all of the supported operators. > > I ran the .q files to verify they still pass. > > I ran some queries to verify there was a CPU benefit to doing this. I saw as > much as a 45% reduction in the total CPU used by the map reduce job to scan > the index, for a large data set. > > > Thanks, > > Kevin > >