[ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025107#comment-13025107 ]
jirapos...@reviews.apache.org commented on HIVE-2121: ----------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review558 ----------------------------------------------------------- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java <https://reviews.apache.org/r/633/#comment1178> The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number. Do you mean this number is actually the seed to generate samples? trunk/conf/hive-default.xml <https://reviews.apache.org/r/633/#comment1179> same as above. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1195> better add a comment for this function explain what the rationale behind sampling splits. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1192> we should declare retLists as interface (List) rather than implementation (ArrayList) trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1193> same here, should declare it as Map rather than HashMap. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java <https://reviews.apache.org/r/633/#comment1194> can you add comments here? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1182> can you add a comment on what this Map is used for and what are the key and value of the Map? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1184> I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1189> you may want to check the percentage number (if it is a valid double and within the range [0,100]) and throw SemanticException if it is invalid before creating a SplitSample object. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java <https://reviews.apache.org/r/633/#comment1190> why limit cannot be combined with block sampling? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java <https://reviews.apache.org/r/633/#comment1191> This comment doesn't belong to this class. - Ning On 2011-04-20 18:28:29, Siying Dong wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/633/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-04-20 18:28:29) bq. bq. bq. Review request for hive, Ning Zhang and namit jain. bq. bq. bq. Summary bq. ------- bq. bq. We need a better input sampling to serve at least two purposes: bq. 1. test their queries against a smaller data set bq. 2. understand more about how the data look like without scanning the whole table. bq. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. bq. bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. bq. bq. bq. This addresses bug HIVE-2121. bq. https://issues.apache.org/jira/browse/HIVE-2121 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 bq. trunk/conf/hive-default.xml 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 bq. bq. Diff: https://reviews.apache.org/r/633/diff bq. bq. bq. Testing bq. ------- bq. bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters. bq. bq. bq. Thanks, bq. bq. Siying bq. bq. > Input Sampling By Splits > ------------------------ > > Key: HIVE-2121 > URL: https://issues.apache.org/jira/browse/HIVE-2121 > Project: Hive > Issue Type: New Feature > Reporter: Siying Dong > Assignee: Siying Dong > Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch > > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole > table. > A simple function that gives a subset splits will help in those cases. It > doesn't have to be strict sampling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira