[jira] [Commented] (HIVE-2121) Input Sampling By Splits

[email protected] (JIRA) Tue, 26 Apr 2011 15:11:45 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025468#comment-13025468
 ]

[email protected] commented on HIVE-2121:
-----------------------------------------------------

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
line 498
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq.  >
bq.  >     I feel like it is a little hard to explain what this sample 
guarantees. It basically only guarantees that we fetch at least the sampled 
percentage of source data. Not exact number, nor guarantee for #rows. I think 
an option to disable it is a way to avoid confusion in some ways. How do you 
think?
bq.  
bq.  Ning Zhang wrote:
bq.      I think if we specify clearly the semantics of block-level sample in 
the wiki/documentation, there shouldn't be much confusion. In fact I think it 
is much easier to explain than the bucket-based sampling. In addition if the 
user has confusions about the semantics, throwing an SemanticException won't 
help them understand.  I think the only use case for this parameter is to act 
as a gatekeeper to this feature if we found a bug in it and want to disable the 
feature quickly. That should be able to be achieved by switching branches 
quickly. If we have a gatekeeper parameter for each feature, the conf will grow 
unnecessarily large quickly.

OK. I'll remove this switch. We need to document and communicate very well to 
users. People will easily misunderstand this.

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
line 6392
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq.  >
bq.  >     limit can be combined with block sampling. Just this optimization 
for limit doesn't make sense when users already sample the input data and we 
won't get much benefit.
bq.  
bq.  Ning Zhang wrote:
bq.      I think combining these two still makes sense: 1) as you mentioned 
block sampling is not limiting on the # of rows, but limit is. Combining these 
two allows the users to get approximately N rows quickly. 2) this restriction 
makes an exception in terms of the query language composition. From the 
language syntax, it is allowed and makes senses to combine block-sampling and 
limit, but the user will get a SemanticException if they do. I think 
SemanticException should be thrown only when there is a legitimate semantic 
error (e.g., the percentage number is negative). If you feel that it is not a 
major use case and would rather do it in a follow-up JIRA, we should document 
it in TODO and file a JIRA for it.

I think it is the misunderstanding here. Limit works with split sampling well. 
No exception will be thrown in any of those two combination and the result will 
be what we expected.
This condition only disabled the optimization that runs against a smaller data 
set for some limit queries. With split sampling, user already specific what a 
percentage to sample, there is no need to further run the query against a small 
subset of the inputs. From test case split_sample.q, you can already see how 
they work together well.

- Siying

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole 
table.
bq.  A simple function that gives a subset splits will help in those cases. It 
doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which 
samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Reply via email to