[jira] [Commented] (HIVE-2121) Input Sampling By Splits

[email protected] (JIRA) Tue, 26 Apr 2011 15:01:44 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025465#comment-13025465
 ]

[email protected] commented on HIVE-2121:
-----------------------------------------------------

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
line 498
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq.  >
bq.  >     I feel like it is a little hard to explain what this sample 
guarantees. It basically only guarantees that we fetch at least the sampled 
percentage of source data. Not exact number, nor guarantee for #rows. I think 
an option to disable it is a way to avoid confusion in some ways. How do you 
think?

I think if we specify clearly the semantics of block-level sample in the 
wiki/documentation, there shouldn't be much confusion. In fact I think it is 
much easier to explain than the bucket-based sampling. In addition if the user 
has confusions about the semantics, throwing an SemanticException won't help 
them understand.  I think the only use case for this parameter is to act as a 
gatekeeper to this feature if we found a bug in it and want to disable the 
feature quickly. That should be able to be achieved by switching branches 
quickly. If we have a gatekeeper parameter for each feature, the conf will grow 
unnecessarily large quickly. 

bq.  On 2011-04-26 20:50:30, Siying Dong wrote:
bq.  > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
line 6392
bq.  > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq.  >
bq.  >     limit can be combined with block sampling. Just this optimization 
for limit doesn't make sense when users already sample the input data and we 
won't get much benefit.

I think combining these two still makes sense: 1) as you mentioned block 
sampling is not limiting on the # of rows, but limit is. Combining these two 
allows the users to get approximately N rows quickly. 2) this restriction makes 
an exception in terms of the query language composition. From the language 
syntax, it is allowed and makes senses to combine block-sampling and limit, but 
the user will get a SemanticException if they do. I think SemanticException 
should be thrown only when there is a legitimate semantic error (e.g., the 
percentage number is negative). If you feel that it is not a major use case and 
would rather do it in a follow-up JIRA, we should document it in TODO and file 
a JIRA for it. 

- Ning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------

On 2011-04-26 21:19:18, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-26 21:19:18)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole 
table.
bq.  A simple function that gives a subset splits will help in those cases. It 
doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which 
samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
bq.    trunk/conf/hive-default.xml 1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1096852 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1096852 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Reply via email to