----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review558 -----------------------------------------------------------
trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java <https://reviews.apache.org/r/633/#comment1178> The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number. Do you mean this number is actually the seed to generate samples? trunk/conf/hive-default.xml <https://reviews.apache.org/r/633/#comment1179> same as above. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1195> better add a comment for this function explain what the rationale behind sampling splits. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1192> we should declare retLists as interface (List) rather than implementation (ArrayList) trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1193> same here, should declare it as Map rather than HashMap. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1194> can you add comments here? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1182> can you add a comment on what this Map is used for and what are the key and value of the Map? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1184> I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1189> you may want to check the percentage number (if it is a valid double and within the range [0,100]) and throw SemanticException if it is invalid before creating a SplitSample object. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1190> why limit cannot be combined with block sampling? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java <https://reviews.apache.org/r/633/#comment1191> This comment doesn't belong to this class. - Ning On 2011-04-20 18:28:29, Siying Dong wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/633/ > ----------------------------------------------------------- > > (Updated 2011-04-20 18:28:29) > > > Review request for hive, Ning Zhang and namit jain. > > > Summary > ------- > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole > table. > A simple function that gives a subset splits will help in those cases. It > doesn't have to be strict sampling. > > This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples > input splits with size at least n% of the original inputs. > > > This addresses bug HIVE-2121. > https://issues.apache.org/jira/browse/HIVE-2121 > > > Diffs > ----- > > trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 > trunk/conf/hive-default.xml 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java > 1095244 > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java > PRE-CREATION > trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 > trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q > PRE-CREATION > trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q > PRE-CREATION > trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION > trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out > PRE-CREATION > trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out > PRE-CREATION > trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 > trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 > trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 > trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 > trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 > > Diff: https://reviews.apache.org/r/633/diff > > > Testing > ------- > > TestCliDriver TestNegativeCliDriver, manual tests on real clusters. > > > Thanks, > > Siying > >
