[jira] [Updated] (HIVE-28963) Optimize the listing files for direct insert

Kokila N (Jira) Mon, 26 May 2025 23:53:08 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kokila N updated HIVE-28963:
----------------------------
    Description: 
*Background:*
Insert Overwrite query with hive.acid.direct.insert.enabled=true writes data 
directly to the target directory(actual table location) in the MOVE stage. 
Say insert query is of writeid 5 and cleaner has got writeid 2 directories 
marked as obsolete
{code:java}
private static Path[] getDirectInsertDirectoryCandidatesRecursive(FileSystem fs,
    Path path, int skipLevels, PathFilter filter) throws IOException {
  String lastRelDir = null;
  HashSet<Path> results = new HashSet<Path>();
  String relRoot = Path.getPathWithoutSchemeAndAuthority(path).toString();
  if (!relRoot.endsWith(Path.SEPARATOR)) {
    relRoot += Path.SEPARATOR;
  }
  RemoteIterator<LocatedFileStatus> allFiles = fs.listFiles(path, true);
  while (allFiles.hasNext()) {
  LocatedFileStatus lfs = allFiles.next();
  .
  .
}{code}
_*fs.listFiles(path, true)*_
    - This recursively lists {*}all files{*}, even those that may be obsolete 
or deleted during iteration. So if Hive's cleaner deletes {{base_000002}} 
_after_ it's discovered by {{listFiles()}} but _before_ {{hasNext()}} tries to 
access it (for metadata resolution), we get a {{{}FileNotFoundException{}}}.

There is a fix for this issue in Hadoop upstream HADOOP-18662
This was discovered when run hadoop version without this fix.

But why list all the files recursively irrespective of the current write id 
which is unnecessary which we will be filtering out later.
Need to optimize from hive side to see if we can filter and list only current 
write id files from hdfs.

  was:
Insert Overwrite query with hive.acid.direct.insert.enabled=true writes data 
directly to the target directory(actual table location) in the MOVE stage. 
Say insert query is of writeid 5 and cleaner has got writeid 2 directories 
marked as obsolete
{code:java}
private static Path[] getDirectInsertDirectoryCandidatesRecursive(FileSystem fs,
    Path path, int skipLevels, PathFilter filter) throws IOException {
  String lastRelDir = null;
  HashSet<Path> results = new HashSet<Path>();
  String relRoot = Path.getPathWithoutSchemeAndAuthority(path).toString();
  if (!relRoot.endsWith(Path.SEPARATOR)) {
    relRoot += Path.SEPARATOR;
  }
  RemoteIterator<LocatedFileStatus> allFiles = fs.listFiles(path, true);
  while (allFiles.hasNext()) {
  LocatedFileStatus lfs = allFiles.next();
  .
  .
}{code}
_*fs.listFiles(path, true)*_
    - This recursively lists {*}all files{*}, even those that may be obsolete 
or deleted during iteration. So if Hive's cleaner deletes {{base_000002}} 
_after_ it's discovered by {{listFiles()}} but _before_ {{hasNext()}} tries to 
access it (for metadata resolution), we get a {{{}FileNotFoundException{}}}.

*Cleaner:*
  There is no issue with the cleaner because it is deleting only the 
files/directories that are marked as obsolete. 

There is a fix for this issue in Hadoop upstream HADOOP-18662
This was discovered when run hadoop version without this fix.

Considering that we can handle this issue and also optimize from hive side.

This scenario is race condition but also should occur rarely as cleaner should 
delete those directories in move stage of the insert between {{listFiles()and}} 
{{{}hasNext(){}}}to trigger this issue. 


> Optimize the listing files for direct insert
> --------------------------------------------
>
>                 Key: HIVE-28963
>                 URL: https://issues.apache.org/jira/browse/HIVE-28963
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Kokila N
>            Assignee: Kokila N
>            Priority: Major
>
> *Background:*
> Insert Overwrite query with hive.acid.direct.insert.enabled=true writes data 
> directly to the target directory(actual table location) in the MOVE stage. 
> Say insert query is of writeid 5 and cleaner has got writeid 2 directories 
> marked as obsolete
> {code:java}
> private static Path[] getDirectInsertDirectoryCandidatesRecursive(FileSystem 
> fs,
>     Path path, int skipLevels, PathFilter filter) throws IOException {
>   String lastRelDir = null;
>   HashSet<Path> results = new HashSet<Path>();
>   String relRoot = Path.getPathWithoutSchemeAndAuthority(path).toString();
>   if (!relRoot.endsWith(Path.SEPARATOR)) {
>     relRoot += Path.SEPARATOR;
>   }
>   RemoteIterator<LocatedFileStatus> allFiles = fs.listFiles(path, true);
>   while (allFiles.hasNext()) {
>   LocatedFileStatus lfs = allFiles.next();
>   .
>   .
> }{code}
> _*fs.listFiles(path, true)*_
>     - This recursively lists {*}all files{*}, even those that may be obsolete 
> or deleted during iteration. So if Hive's cleaner deletes {{base_000002}} 
> _after_ it's discovered by {{listFiles()}} but _before_ {{hasNext()}} tries 
> to access it (for metadata resolution), we get a 
> {{{}FileNotFoundException{}}}.
> There is a fix for this issue in Hadoop upstream HADOOP-18662
> This was discovered when run hadoop version without this fix.
> But why list all the files recursively irrespective of the current write id 
> which is unnecessary which we will be filtering out later.
> Need to optimize from hive side to see if we can filter and list only current 
> write id files from hdfs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28963) Optimize the listing files for direct insert

Reply via email to