Re: Review Request: Input Sampling Splits

Ning Zhang Tue, 26 Apr 2011 15:15:59 -0700


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
> > line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. 
> > It basically only guarantees that we fetch at least the sampled percentage 
> > of source data. Not exact number, nor guarantee for #rows. I think an 
> > option to disable it is a way to avoid confusion in some ways. How do you 
> > think?
> 
> Ning Zhang wrote:
>     I think if we specify clearly the semantics of block-level sample in the 
> wiki/documentation, there shouldn't be much confusion. In fact I think it is 
> much easier to explain than the bucket-based sampling. In addition if the 
> user has confusions about the semantics, throwing an SemanticException won't 
> help them understand.  I think the only use case for this parameter is to act 
> as a gatekeeper to this feature if we found a bug in it and want to disable 
> the feature quickly. That should be able to be achieved by switching branches 
> quickly. If we have a gatekeeper parameter for each feature, the conf will 
> grow unnecessarily large quickly.
> 
> Siying Dong wrote:
>     OK. I'll remove this switch. We need to document and communicate very 
> well to users. People will easily misunderstand this.


Sounds good. Yes, we should clearly document it in the wiki by what it does and 
does not. 


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
> > line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for 
> > limit doesn't make sense when users already sample the input data and we 
> > won't get much benefit.
> 
> Ning Zhang wrote:
>     I think combining these two still makes sense: 1) as you mentioned block 
> sampling is not limiting on the # of rows, but limit is. Combining these two 
> allows the users to get approximately N rows quickly. 2) this restriction 
> makes an exception in terms of the query language composition. From the 
> language syntax, it is allowed and makes senses to combine block-sampling and 
> limit, but the user will get a SemanticException if they do. I think 
> SemanticException should be thrown only when there is a legitimate semantic 
> error (e.g., the percentage number is negative). If you feel that it is not a 
> major use case and would rather do it in a follow-up JIRA, we should document 
> it in TODO and file a JIRA for it.
> 
> Siying Dong wrote:
>     I think it is the misunderstanding here. Limit works with split sampling 
> well. No exception will be thrown in any of those two combination and the 
> result will be what we expected.
>     This condition only disabled the optimization that runs against a smaller 
> data set for some limit queries. With split sampling, user already specific 
> what a percentage to sample, there is no need to further run the query 
> against a small subset of the inputs. From test case split_sample.q, you can 
> already see how they work together well.

oh I see. Thanks for the clarification!


- Ning


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples 
> input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
> PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Reply via email to