[ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025468#comment-13025468 ]
jirapos...@reviews.apache.org commented on HIVE-2121: ----------------------------------------------------- bq. On 2011-04-26 20:50:30, Siying Dong wrote: bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498 bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498> bq. > bq. > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? bq. bq. Ning Zhang wrote: bq. I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this. bq. On 2011-04-26 20:50:30, Siying Dong wrote: bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392 bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392> bq. > bq. > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. bq. bq. Ning Zhang wrote: bq. I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected. This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well. - Siying ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- On 2011-04-26 21:19:18, Siying Dong wrote: bq. bq. ----------------------------------------------------------- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/633/ bq. ----------------------------------------------------------- bq. bq. (Updated 2011-04-26 21:19:18) bq. bq. bq. Review request for hive, Ning Zhang and namit jain. bq. bq. bq. Summary bq. ------- bq. bq. We need a better input sampling to serve at least two purposes: bq. 1. test their queries against a smaller data set bq. 2. understand more about how the data look like without scanning the whole table. bq. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. bq. bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. bq. bq. bq. This addresses bug HIVE-2121. bq. https://issues.apache.org/jira/browse/HIVE-2121 bq. bq. bq. Diffs bq. ----- bq. bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 bq. trunk/conf/hive-default.xml 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 bq. bq. Diff: https://reviews.apache.org/r/633/diff bq. bq. bq. Testing bq. ------- bq. bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters. bq. bq. bq. Thanks, bq. bq. Siying bq. bq. > Input Sampling By Splits > ------------------------ > > Key: HIVE-2121 > URL: https://issues.apache.org/jira/browse/HIVE-2121 > Project: Hive > Issue Type: New Feature > Reporter: Siying Dong > Assignee: Siying Dong > Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch > > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole > table. > A simple function that gives a subset splits will help in those cases. It > doesn't have to be strict sampling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira