Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Sahil Takiar Wed, 18 Jan 2017 14:51:07 -0800


> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > Can this be considered a S3 performance only? or is it fine to have many 
> > RPC calls in case the path is on HDFS?


We could make this S3 / blobstore specific, but this code block will also 
create empty files for each empty partition it finds, which can take some 
amount of time on HDFS.

It's the same number of RPC calls to HDFS, they are just all done in parallel. 
The value of mapred.dfsclient.parallelism.max probably needs to be tuned 
depending on if your running on HDFS or S3.


> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 2992
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line2992>
> >
> >     1. what happens if numThread is a negative number?
> >     2. do we need to execute a threadpool if numThreads is just 1?

Will fix.


> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 3016
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line3016>
> >
> >     Can this happen? I followed the workflow when file is null, and found 
> > that createDummyFileForEmptyPartition() may fail when attempts to call 
> > path.toString(). Is that right? What can we do to avoid a NPE there?

Good point. I think we can log a warning, and then continue on to the next 
file. I don't see why a null file would be added to the `pathToAliases` but it 
is possible. Will fix.


- Sahil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162184
-----------------------------------------------------------


On Jan. 18, 2017, 3:17 a.m., Sahil Takiar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 3:17 a.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-15546
>     https://issues.apache.org/jira/browse/HIVE-15546
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a 
> partition is done in parallel
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 
> 
> Diff: https://reviews.apache.org/r/55661/diff/
> 
> 
> Testing
> -------
> 
> Unit tests added
> 
> 
> Thanks,
> 
> Sahil Takiar
> 
>

Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Reply via email to