Re: relative path in DataFrameWriter and DataStreamWriter

Jungtaek Lim Thu, 16 Jan 2025 20:55:09 -0800

Examples are assuming you are running them in the single node cluster. If
you feel like it's causing confusion, this is something we need to fix,
e.g. put disclaimer that the example is based on the assumption it is
running with a single node cluster.


>> > More problematic thing is to use the local filesystem for the path
which is interpreted by distributed machines.

> It depends. Nowadays distributed systems mostly use cloud (S3, GFS, etc)
or HDFS, but NFS and other locally mounted FS can still be in use and
should be supported.

NFS is not a local filesystem from my comment. What I'm really saying is,
every node must see the same location when you specify the path. You can't
let each node write to different physical paths and claim it is working. It
doesn't and it is NOT a spec. It's not a bug, sorry.



On Fri, Jan 17, 2025 at 4:41 AM Rozov, Vlad <vro...@amazon.com.invalid>
wrote:

> > More problematic thing is to use the local filesystem for the path which
> is interpreted by distributed machines.
>
> It depends. Nowadays distributed systems mostly use cloud (S3, GFS, etc)
> or HDFS, but NFS and other locally mounted FS can still be in use and
> should be supported.
>
> >  this actually requires people to mostly use absolute paths (including
> scheme or not).
>
> There is no validation that absolute path (with or without schema) is used
> in the API, and examples in the doc (
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) use
> relative path.
>
> > we are not expecting metadata directory and the actual files to be
> placed in physically different locations;
>
> Sounds like a bug to me, so I will file JIRA and fix it.
>
> Thank you,
>
> Vlad
>
> On Jan 15, 2025, at 8:45 PM, Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
> > I do understand that using relative path is not the best option
> especially in the distributed systems
>
> More problematic thing is to use the local filesystem for the path which
> is interpreted by distributed machines. Yes, using relative paths is also
> problematic since it depends on the working directory and there is no
> guarantee with it (It really depends on the setup of the cluster). But we
> are not expecting metadata directory and the actual files to be placed in
> physically different locations; this actually requires people to mostly use
> absolute paths (including scheme or not).
>
> On Thu, Jan 16, 2025 at 3:16 AM Rozov, Vlad <vro...@amazon.com.invalid>
> wrote:
>
>> Resending...
>>
>> > On Jan 9, 2025, at 1:57 PM, Rozov, Vlad <vro...@amazon.com.INVALID>
>> wrote:
>> >
>> > Hi,
>> >
>> > I see a difference in how “path" is handled in
>> DataFrameWriter.save(path) and DataStreamWriter.start(path) while using
>> relative path (for example “test.parquet") to write to parquet files
>> (possibly applies to other file formats as well). In case of
>> DataFrameWriter path is relative to the current working directory (of the
>> driver). And this is what I would expect it to be. In the case of
>> DataStreamWriter only _spark_metadata is written to the directory relative
>> to the current working directory of the driver and parquet files are
>> written to the directory that is relative to the executor directory. Is
>> this a bug caused by relative path being passed to an executor as is or the
>> behavior is by design? In the later case, what is the rationale?
>> >
>> > I do understand that using relative path is not the best option
>> especially in the distributed systems, but I think that relative path is
>> still commonly used for testing and prototyping (and in examples).
>> >
>> > Thank you,
>> >
>> > Vlad
>>
>
>

Re: relative path in DataFrameWriter and DataStreamWriter

Reply via email to