[ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025465#comment-13025465 ]
jirapos...@reviews.apache.org commented on HIVE-2121: ----------------------------------------------------- bq. On 2011-04-26 20:50:30, Siying Dong wrote: bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498 bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498> bq. > bq. > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. bq. On 2011-04-26 20:50:30, Siying Dong wrote: bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392 bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392> bq. > bq. > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. - Ning ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- On 2011-04-26 21:19:18, Siying Dong wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/633/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-04-26 21:19:18) bq. bq. bq. Review request for hive, Ning Zhang and namit jain. bq. bq. bq. Summary bq. ------- bq. bq. We need a better input sampling to serve at least two purposes: bq. 1. test their queries against a smaller data set bq. 2. understand more about how the data look like without scanning the whole table. bq. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. bq. bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. bq. bq. bq. This addresses bug HIVE-2121. bq. https://issues.apache.org/jira/browse/HIVE-2121 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 bq. trunk/conf/hive-default.xml 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 bq. bq. Diff: https://reviews.apache.org/r/633/diff bq. bq. bq. Testing bq. ------- bq. bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters. bq. bq. bq. Thanks, bq. bq. Siying bq. bq. > Input Sampling By Splits > ------------------------ > > Key: HIVE-2121 > URL: https://issues.apache.org/jira/browse/HIVE-2121 > Project: Hive > Issue Type: New Feature > Reporter: Siying Dong > Assignee: Siying Dong > Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch > > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole > table. > A simple function that gives a subset splits will help in those cases. It > doesn't have to be strict sampling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira