[jira] [Commented] (HIVE-28963) Optimize the listing files for direct insert

Kokila N (Jira) Tue, 27 May 2025 00:37:33 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-28963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954245#comment-17954245
 ]


Kokila N commented on HIVE-28963:
---------------------------------

Thanks for the feedback — great points.

I've updated the title and description.

    Regarding the note on {{{}listFiles{}}}: you're absolutely right, for 
object stores like S3, {{FileSystem.listFiles(path, true)}} has special 
handling under the hood. It’s more memory-efficient so the performance cost 
isn’t as severe as with HDFS.

    Also agreed that {{PathFilter}} operates only on the client side, so using 
it alone won’t reduce the number of file metadata entries fetched. I am 
considering to get all the partitions and then use write id to check the delta 
directories. This is just a thought so far. I need to investigate it.

> Optimize the listing files for direct insert
> --------------------------------------------
>
>                 Key: HIVE-28963
>                 URL: https://issues.apache.org/jira/browse/HIVE-28963
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Kokila N
>            Assignee: Kokila N
>            Priority: Major
>
> *Background:*
> Insert Overwrite query with hive.acid.direct.insert.enabled=true writes data 
> directly to the target directory(actual table location) in the MOVE stage. 
> Say insert query is of writeid 5 and cleaner has got writeid 2 directories 
> marked as obsolete
> {code:java}
> private static Path[] getDirectInsertDirectoryCandidatesRecursive(FileSystem 
> fs,
>     Path path, int skipLevels, PathFilter filter) throws IOException {
>   String lastRelDir = null;
>   HashSet<Path> results = new HashSet<Path>();
>   String relRoot = Path.getPathWithoutSchemeAndAuthority(path).toString();
>   if (!relRoot.endsWith(Path.SEPARATOR)) {
>     relRoot += Path.SEPARATOR;
>   }
>   RemoteIterator<LocatedFileStatus> allFiles = fs.listFiles(path, true);
>   while (allFiles.hasNext()) {
>   LocatedFileStatus lfs = allFiles.next();
>   .
>   .
> }{code}
> _*fs.listFiles(path, true)*_
>     - This recursively lists {*}all files{*}, even those that may be obsolete 
> or deleted during iteration. So if Hive's cleaner deletes {{base_000002}} 
> _after_ it's discovered by {{listFiles()}} but _before_ {{hasNext()}} tries 
> to access it (for metadata resolution), we get a 
> {{{}FileNotFoundException{}}}.
> There is a fix for this issue in Hadoop upstream HADOOP-18662
> This was discovered when run hadoop version without this fix.
> But why list all the files recursively irrespective of the current write id 
> which is unnecessary which we will be filtering out later.
> Need to optimize from hive side to see if we can filter and list only current 
> write id files from hdfs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28963) Optimize the listing files for direct insert

Reply via email to