[jira] [Commented] (HIVE-2121) Input Sampling By Splits

[email protected] (JIRA) Tue, 26 Apr 2011 00:26:46 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025107#comment-13025107
 ]

[email protected] commented on HIVE-2121:
-----------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review558
-----------------------------------------------------------

trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/633/#comment1178>

    The naming of this parameter is a little bit confusing: the parameter key 
is called "randomnumber" but the value of it is a fixed number.  Do you mean 
this number is actually the seed to generate samples?

trunk/conf/hive-default.xml
<https://reviews.apache.org/r/633/#comment1179>

    same as above.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1195>

    better add a comment for this function explain what the rationale behind 
sampling splits.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1192>

    we should declare retLists as interface (List) rather than implementation 
(ArrayList)

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1193>

    same here, should declare it as Map rather than HashMap.

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1194>

    can you add comments here? 

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1182>

    can you add a comment on what this Map is used for and what are the key and 
value of the Map?

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1184>

    I think we don't need to introduce this parameter at all. For one it is a 
new feature rather than a different code path for an old feature. We don't need 
the "fallback" protection by a new parameter. Secondly, throwing a 
SemanticException here can only make the user asking how to solve this problem, 
which is to set the parameter to true. So it seems that it doesn't make sense 
to set the parameter to false in any cases. So why not remove the this 
parameter to make it cleaner. 

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1189>

    you may want to check the percentage number (if it is a valid double and 
within the range [0,100]) and throw SemanticException if it is invalid before 
creating a SplitSample object.

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1190>

    why limit cannot be combined with block sampling?

trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
<https://reviews.apache.org/r/633/#comment1191>

    This comment doesn't belong to this class.

- Ning

On 2011-04-20 18:28:29, Siying Dong wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/633/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-04-20 18:28:29)
bq.  
bq.  
bq.  Review request for hive, Ning Zhang and namit jain.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  We need a better input sampling to serve at least two purposes:
bq.  1. test their queries against a smaller data set
bq.  2. understand more about how the data look like without scanning the whole 
table.
bq.  A simple function that gives a subset splits will help in those cases. It 
doesn't have to be strict sampling.
bq.  
bq.  This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which 
samples input splits with size at least n% of the original inputs.
bq.  
bq.  
bq.  This addresses bug HIVE-2121.
bq.      https://issues.apache.org/jira/browse/HIVE-2121
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 
bq.    trunk/conf/hive-default.xml 1095244 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
1095244 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
1095244 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
1095244 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1095244 
bq.    
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 
1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1095244 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
PRE-CREATION 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
PRE-CREATION 
bq.    trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
PRE-CREATION 
bq.    trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 
bq.    trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 
bq.  
bq.  Diff: https://reviews.apache.org/r/633/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Siying
bq.  
bq.

> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Reply via email to