vrozov commented on PR #49654:
URL: https://github.com/apache/spark/pull/49654#issuecomment-2632483929

   > I'd like to make this be super clear what scenario(s) make us struggle 
without this fix and how this fix will help resolving it.
   
   Please see [SPARK-50854](https://issues.apache.org/jira/browse/SPARK-50854). 
When relative path is used in structured streaming (`DataStreamWriter`), the 
parquet files are written to the location relative to the executor (for example 
`/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/work/app-20250203151257-0000/5/test.parquet/part-00000-54a33cfa-b34c-4e84-8589-cfe763b18ccd-c000.snappy.parquet`)
 instead of the location relative to the driver as in the case of batch 
processing.
   
   This PR *does not* address the validity and/or use case of the relative path 
usage that is likely limited to a single node cluster where driver, master and 
executors have access to the same local file system, though there may be other 
setups where the same condition applies.
   
   >I don't think it is very common scenario that people installs Spark in 
multiple directories and runs driver and executor in separate directory (or any 
way to set different working directory). Using relative path which resolves to 
different directories per process doesn't seem like a common scenario and I'd 
like to see the detail.
   
   The PR does not target installation into different directories. The problem 
is reproducible on a single node cluster with default installation (for example 
dev build or `brew install`). The driver is likely to be executed from the root 
or from it's own directory as driver code is different from the Spark code.
   
   > You can explain in other way around - if you see that file source/sink 
resolves the path in driver in batch query, please describe what's your setup 
and how you tested and what's the result backing up your claim. It could be 
used to make a valid claim that we want to have consistence between batch and 
streaming.
   
   There is no custom setup done for the Spark. It is default single node 
cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to