Re: [PR] [SPARK-50854][SS] Make path fully qualified before passing it to FileStreamSink [spark]

via GitHub Mon, 03 Feb 2025 16:37:03 -0800


vrozov commented on PR #49654:
URL: https://github.com/apache/spark/pull/49654#issuecomment-2632483929

> I'd like to make this be super clear what scenario(s) make us struggle
without this fix and how this fix will help resolving it.

Please see [SPARK-50854](https://issues.apache.org/jira/browse/SPARK-50854).
When relative path is used in structured streaming (`DataStreamWriter`), the
parquet files are written to the location relative to the executor (for example
`/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/work/app-20250203151257-0000/5/test.parquet/part-00000-54a33cfa-b34c-4e84-8589-cfe763b18ccd-c000.snappy.parquet`)
instead of the location relative to the driver as in the case of batch
processing.

This PR *does not* address the validity and/or use case of the relative path
usage that is likely limited to a single node cluster where driver, master and
executors have access to the same local file system, though there may be other
setups where the same condition applies.

>I don't think it is very common scenario that people installs Spark in
multiple directories and runs driver and executor in separate directory (or any
way to set different working directory). Using relative path which resolves to
different directories per process doesn't seem like a common scenario and I'd
like to see the detail.

The PR does not target installation into different directories. The problem
is reproducible on a single node cluster with default installation (for example
dev build or `brew install`). The driver is likely to be executed from the root
or from it's own directory as driver code is different from the Spark code.

> You can explain in other way around - if you see that file source/sink
resolves the path in driver in batch query, please describe what's your setup
and how you tested and what's the result backing up your claim. It could be
used to make a valid claim that we want to have consistence between batch and
streaming.

There is no custom setup done for the Spark. It is default single node
cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-50854][SS] Make path fully qualified before passing it to FileStreamSink [spark]

Reply via email to