Re: relative path in DataFrameWriter and DataStreamWriter

Rozov, Vlad Tue, 28 Jan 2025 14:37:53 -0800

Please review https://github.com/apache/spark/pull/49654

> Your best bet is to make relative path in driver to be resolved to absolute 
> path and pass over to executor with that resolved path.

Right, this is exactly what I was going to implement and how it is done for 
DataFrameWriter in the DataSource.

> Examples are assuming you are running them in the single node cluster. If you 
> feel like it's causing confusion, this is something we need to fix, e.g. put 
> disclaimer that the example is based on the assumption it is running with a 
> single node cluster.

IMO, it will be better to state any assumption explicitly. Note that it is not 
only assumes a single node cluster. The assumption is that the examples are 
also run on the same node as the cluster. For example running single node 
cluster in the docker container and examples on the host may not work properly 
(depends on how docker container FS is setup).

> NFS is not a local filesystem from my comment. What I'm really saying is, 
> every node must see the same location when you specify the path. You can't 
> let each node write to different physical paths and claim it is working. It 
> doesn't and it is NOT a spec. It's not a bug, sorry.

IMO we are on the same page here. For distributed compute it is necessary to 
use distributed FS so that every compute node has access to a logical path. I 
was referring to the https://issues.apache.org/jira/browse/SPARK-50854 that 
applies to a single node cluster.

Thank you,

Vlad

On Jan 16, 2025, at 8:57 PM, Jungtaek Lim <[email protected]> wrote:

Your best bet is to make relative path in driver to be resolved to absolute 
path and pass over to executor with that resolved path. This needs some 
discussion whether we want to do that, but this is at least technically correct.

On Fri, Jan 17, 2025 at 1:54 PM Jungtaek Lim 
<[email protected]<mailto:[email protected]>> wrote:
Examples are assuming you are running them in the single node cluster. If you 
feel like it's causing confusion, this is something we need to fix, e.g. put 
disclaimer that the example is based on the assumption it is running with a 
single node cluster.

>> > More problematic thing is to use the local filesystem for the path which 
>> > is interpreted by distributed machines.

> It depends. Nowadays distributed systems mostly use cloud (S3, GFS, etc) or 
> HDFS, but NFS and other locally mounted FS can still be in use and should be 
> supported.

NFS is not a local filesystem from my comment. What I'm really saying is, every 
node must see the same location when you specify the path. You can't let each 
node write to different physical paths and claim it is working. It doesn't and 
it is NOT a spec. It's not a bug, sorry.

On Fri, Jan 17, 2025 at 4:41 AM Rozov, Vlad <[email protected]> wrote:
> More problematic thing is to use the local filesystem for the path which is 
> interpreted by distributed machines.

It depends. Nowadays distributed systems mostly use cloud (S3, GFS, etc) or 
HDFS, but NFS and other locally mounted FS can still be in use and should be 
supported.

>  this actually requires people to mostly use absolute paths (including scheme 
> or not).

There is no validation that absolute path (with or without schema) is used in 
the API, and examples in the doc 
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) use 
relative path.

> we are not expecting metadata directory and the actual files to be placed in 
> physically different locations;

Sounds like a bug to me, so I will file JIRA and fix it.

Thank you,

Vlad

On Jan 15, 2025, at 8:45 PM, Jungtaek Lim 
<[email protected]<mailto:[email protected]>> wrote:

> I do understand that using relative path is not the best option especially in 
> the distributed systems

More problematic thing is to use the local filesystem for the path which is 
interpreted by distributed machines. Yes, using relative paths is also 
problematic since it depends on the working directory and there is no guarantee 
with it (It really depends on the setup of the cluster). But we are not 
expecting metadata directory and the actual files to be placed in physically 
different locations; this actually requires people to mostly use absolute paths 
(including scheme or not).

On Thu, Jan 16, 2025 at 3:16 AM Rozov, Vlad <[email protected]> wrote:
Resending...

> On Jan 9, 2025, at 1:57 PM, Rozov, Vlad <[email protected]> wrote:
>
> Hi,
>
> I see a difference in how “path" is handled in DataFrameWriter.save(path) and 
> DataStreamWriter.start(path) while using relative path (for example 
> “test.parquet") to write to parquet files (possibly applies to other file 
> formats as well). In case of DataFrameWriter path is relative to the current 
> working directory (of the driver). And this is what I would expect it to be. 
> In the case of DataStreamWriter only _spark_metadata is written to the 
> directory relative to the current working directory of the driver and parquet 
> files are written to the directory that is relative to the executor 
> directory. Is this a bug caused by relative path being passed to an executor 
> as is or the behavior is by design? In the later case, what is the rationale?
>
> I do understand that using relative path is not the best option especially in 
> the distributed systems, but I think that relative path is still commonly 
> used for testing and prototyping (and in examples).
>
> Thank you,
>
> Vlad

Re: relative path in DataFrameWriter and DataStreamWriter

Reply via email to