yashtc opened a new pull request, #54647:
URL: https://github.com/apache/spark/pull/54647

   
   ### What changes were proposed in this pull request?
   
   Schema inference via `mergeSchema` can fail when a file is deleted between 
the file listing step and the footer-reading step. This is a real race 
condition in cloud storage environments where file disappearance between 
listing and reading is common.
   
   The `spark.sql.files.ignoreMissingFiles` option already suppresses 
`FileNotFoundException` during data reads (`FileScanRDD`) but was silently 
ignored during schema inference.
   
   This PR propagates `ignoreMissingFiles` through the Parquet and ORC schema 
inference paths:
   
   - `SchemaMergeUtils.mergeSchemasInParallel`: extracts `ignoreMissingFiles` 
from `parameters` and passes it as a fourth argument to the `schemaReader` 
function (type updated accordingly).
   - `ParquetFileFormat.readParquetFootersInParallel`: catches exceptions with 
`FileNotFoundException` anywhere in the cause chain (using 
`ExceptionUtils.getThrowables`) and skips the file when 
`ignoreMissingFiles=true`. The cause-chain check is needed because Parquet 
wraps `IOException` in `RuntimeException`.
   - `OrcUtils.readSchema` / `readOrcSchemasInParallel`: catches 
`FileNotFoundException` directly before the existing `FileFormatException` 
handler.
   - `OrcFileOperator.getFileReader` / `readOrcSchemasInParallel`: same pattern 
for Hive ORC.
   
   ### Why are the changes needed?
   
   Without this fix, any user that sets `mergeSchema=true` on a path with 
concurrent deletes gets an unrecoverable exception even when they have opted 
into tolerating missing files via `spark.sql.files.ignoreMissingFiles`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes: when `spark.sql.files.ignoreMissingFiles=true`, files that disappear 
between listing and schema reading are now silently skipped (consistent with 
the existing behaviour during data reads) instead of causing an error.
   
   ### How was this patch tested?
   
   - Unit tests in `ParquetFileFormatSuite`: direct calls to 
`readParquetFootersInParallel` with a deleted file (local FS), with a 
`RuntimeException`-wrapped `FileNotFoundException` (via 
`WrappingFNFLocalFileSystem`), and end-to-end through `mergeSchemasInParallel`.
   - Unit tests in `OrcSourceSuite`: direct calls to 
`OrcUtils.readOrcSchemasInParallel` with a deleted file.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Claude Code v2.1.69


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to