mormigil opened a new pull request, #19574:
URL: https://github.com/apache/druid/pull/19574
### Description
All SQL/MSQ ingestion of the form `INSERT/REPLACE … SELECT … FROM
TABLE(EXTERN(...))` that reads a **random-access input format** (Parquet, ORC,
Avro-OCF, SQL, Druid-segment) from remote storage (e.g. S3) fails on the
worker/peon with:
```
Caused by: java.io.IOException: No such file or directory
at java.base/java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.base/java.io.File.createTempFile(File.java:2170)
at org.apache.druid.data.input.InputEntity.fetch(InputEntity.java)
at
org.apache.druid.data.input.parquet.ParquetReader.intermediateRowIterator(ParquetReader.java:86)
at ... ExternalSegment ... ScanQueryFrameProcessor ...
```
Streaming formats (JSON/CSV) and `index_kafka` ingestion are unaffected.
This is a regression: it works in 32.x and breaks in 37.0.0.
### Root cause
Random-access formats download each remote object to a local temp file via
`InputEntity#fetch(temporaryDirectory, …)`, which calls
`File.createTempFile(prefix, suffix, temporaryDirectory)`. `createTempFile`
does **not** create parent directories. In the MSQ indexer worker the directory
is derived lazily and never created:
| Path | Created? | Where |
|------|----------|-------|
| `<taskWorkDir>/indexing-tmp` | ✅ | `TaskToolbox#getIndexingTmpDir`
(`mkdirp`) |
| `…/indexing-tmp/stage_NNNNNN` | ❌ | `IndexerFrameContext#tempDir` |
| `…/stage_NNNNNN/external` | ❌ | `RunWorkOrder` →
`frameContext.tempDir("external")` → `ExternalInputSliceReader` |
Output channels work because `FileOutputChannelFactory#openChannel` already
calls `FileUtils.mkdirp(...)` before writing; the input fetch path simply
lacked the symmetric call. Streaming formats read via `InputEntity#open()` and
never create a temp file, which is why only fetch-based formats regressed. This
nesting was introduced by the background-fetch / virtual-storage external-input
rewrite (#19539).
### Fix
`mkdirp` the directory in `InputEntity#fetch` right before `createTempFile`,
mirroring `FileOutputChannelFactory#openChannel`. The call is idempotent and
covers every fetch-based input format.
### Verification
Added Parquet coverage to `S3ExternQueryTest` (real embedded cluster +
MinIO), exercising the actual indexer fetch path for both
`backgroundFetchExternalFiles` on and off.
- With the fix: all 4 cases pass.
- Reverting the one-line fix: `test_externParquet_backgroundFetchDisabled`
fails with `java.io.IOException: No such file or directory` (the direct-read
path; the background-fetch path stages via a local `FileEntity` that skips
`createTempFile`).
<hr>
This PR has:
- [x] been self-reviewed.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.
Made with [Cursor](https://cursor.com)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]