Re: Review Request: speedup addInputPaths

Ning Zhang Wed, 15 Jun 2011 12:00:28 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/898/#review842
-----------------------------------------------------------




trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1826>

    This change will make the order of paths in pathProcessed 
non-deterministic. This means mapred.input.dir will have not have the same 
order as before. Not sure if it is safe or not, but if you change HashSet with 
LinkedHashSet, the order will be preserved.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1828>

    Here needs some comments why this case doesn't need to check empty paths. 
    
    In terms of efficiency, it seems to me that checking empty paths is not the 
most expensive part (# of RPCs is large but each listStatus() should be fast). 
Also we should be able to cache (needs to extend Utilities.isEmpty) the results 
of listStatus for each path, which are anyway needed in other operations 
(compute splits etc). If 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1829>

    It seem that we are doing redundant work as 
FileInputFormat.setInputPaths(JobConf, CommaSeparatedString). I think it would 
be safer and cleaner to first get an array of paths and call: 
    
    FileInputFormat.setInputPaths(StringUtils.stringToPath(String[] paths))



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
<https://reviews.apache.org/r/898/#comment1830>

    indentation



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1833>

    why do we need it here?



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1834>

    ! mrwork.getPartDescToRework().isEmpty()



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1835>

    The logic here is too complex and I think it's better to be refactored. Is 
the following what you wanted? 
    
    if (all_partitions_are_rework()) {
      prepareNullCombineFilter(combine);
    } else {
      prepareNormalCombineFilter(combine);
    } 
    InputSplitShim[] iss = combine.getSplits()



trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
<https://reviews.apache.org/r/898/#comment1832>

    Does here just need a HashSet<PartitionDesc> rather than a 
HashMap<PartitionDesc, Boolean>?


- Ning


On 2011-06-14 21:09:13, Yongqiang He wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/898/
> -----------------------------------------------------------
> 
> (Updated 2011-06-14 21:09:13)
> 
> 
> Review request for hive.
> 
> 
> Summary
> -------
> 
> speedup addInputPaths
> 
> 
> This addresses bug HIVE-2218.
>     https://issues.apache.org/jira/browse/HIVE-2218
> 
> 
> Diffs
> -----
> 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> 1135335 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335 
> 
> Diff: https://reviews.apache.org/r/898/diff
> 
> 
> Testing
> -------
> 
> yes.
> 
> 
> Thanks,
> 
> Yongqiang
> 
>

Re: Review Request: speedup addInputPaths

Reply via email to