vrozov commented on PR #49654: URL: https://github.com/apache/spark/pull/49654#issuecomment-2632483929
> I'd like to make this be super clear what scenario(s) make us struggle without this fix and how this fix will help resolving it. Please see [SPARK-50854](https://issues.apache.org/jira/browse/SPARK-50854). When relative path is used in structured streaming (`DataStreamWriter`), the parquet files are written to the location relative to the executor (for example `/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/work/app-20250203151257-0000/5/test.parquet/part-00000-54a33cfa-b34c-4e84-8589-cfe763b18ccd-c000.snappy.parquet`) instead of the location relative to the driver as in the case of batch processing. This PR *does not* address the validity and/or use case of the relative path usage that is likely limited to a single node cluster where driver, master and executors have access to the same local file system, though there may be other setups where the same condition applies. >I don't think it is very common scenario that people installs Spark in multiple directories and runs driver and executor in separate directory (or any way to set different working directory). Using relative path which resolves to different directories per process doesn't seem like a common scenario and I'd like to see the detail. The PR does not target installation into different directories. The problem is reproducible on a single node cluster with default installation (for example dev build or `brew install`). The driver is likely to be executed from the root or from it's own directory as driver code is different from the Spark code. > You can explain in other way around - if you see that file source/sink resolves the path in driver in batch query, please describe what's your setup and how you tested and what's the result backing up your claim. It could be used to make a valid claim that we want to have consistence between batch and streaming. There is no custom setup done for the Spark. It is default single node cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org