[ 
https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983416#comment-16983416
 ] 

Steve Loughran commented on HIVE-22548:
---------------------------------------

Also L1644 it calls path.exists() before the listFiles. Has anyone noticed that 
is marked as deprecated? There's a reason we warn people about it, and it's 
this recurrent code path of exists + operation, which duplicates the expensive 
check for files or directories existing.

*just call listStatus and treat a FileNotFoundException as a sign that the path 
doesn't exist*

It is exactly what exists() does after all.

While I'm looking at that class
h3. removeEmptyDpDirectory

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1601]

This contains a needless listFiles just to see if directory is empty.

if you use delete(path, false) (i.e. the non-recursive one), it does the check 
for having children internally * and rejects the call* . Just swallow any 
exception it raises telling you off about this fact.
 * we have a test for this for every single file system; it is the same as "rm 
dir" on the command line. You do not need to worry about it being implemented 
wrong.

h3. removeTempOrDuplicateFiles

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1757]

delete() returns false in only two conditions
 # you've tried to delete root
 # the file wasn't actually there

you shouldn't need to check and if there is any chance that some other process 
would delete the temp file, would convert a no-op into a failure.
h3. getFileSizeRecursively()

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1840]
 getFileSizeRecursively() is potentially really expensive too.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1853]
 this swallows all exception details. Please include the message and the nested 
exception. Everyone who fields support calls will appreciate this

> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final 
> location
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-22548
>                 URL: https://issues.apache.org/jira/browse/HIVE-22548
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>    Affects Versions: 3.1.2
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also 
> runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to