[ https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983416#comment-16983416 ]
Steve Loughran commented on HIVE-22548: --------------------------------------- Also L1644 it calls path.exists() before the listFiles. Has anyone noticed that is marked as deprecated? There's a reason we warn people about it, and it's this recurrent code path of exists + operation, which duplicates the expensive check for files or directories existing. *just call listStatus and treat a FileNotFoundException as a sign that the path doesn't exist* It is exactly what exists() does after all. While I'm looking at that class h3. removeEmptyDpDirectory [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1601] This contains a needless listFiles just to see if directory is empty. if you use delete(path, false) (i.e. the non-recursive one), it does the check for having children internally * and rejects the call* . Just swallow any exception it raises telling you off about this fact. * we have a test for this for every single file system; it is the same as "rm dir" on the command line. You do not need to worry about it being implemented wrong. h3. removeTempOrDuplicateFiles [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1757] delete() returns false in only two conditions # you've tried to delete root # the file wasn't actually there you shouldn't need to check and if there is any chance that some other process would delete the temp file, would convert a no-op into a failure. h3. getFileSizeRecursively() [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1840] getFileSizeRecursively() is potentially really expensive too. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1853] this swallows all exception details. Please include the message and the nested exception. Everyone who fields support calls will appreciate this > Optimise Utilities.removeTempOrDuplicateFiles when moving files to final > location > --------------------------------------------------------------------------------- > > Key: HIVE-22548 > URL: https://issues.apache.org/jira/browse/HIVE-22548 > Project: Hive > Issue Type: Improvement > Components: Hive > Affects Versions: 3.1.2 > Reporter: Rajesh Balamohan > Priority: Major > > {{Utilities.removeTempOrDuplicateFiles}} > is very slow with cloud storage, as it executes {{listStatus}} twice and also > runs in single threaded mode. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629 -- This message was sent by Atlassian Jira (v8.3.4#803005)