I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally.
A long time ago and not for a Spark app we were solving a similar usecase via https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points . It may work with Spark because it is underneath the FileSystem API ... On Tue, May 21, 2024 at 10:03 PM Prem Sahoo <[email protected]> wrote: > I am looking for writer/comitter optimization which can make the spark > write faster. > > On Tue, May 21, 2024 at 9:15 PM [email protected] <[email protected]> wrote: > >> Hi, >> I think you should write to HDFS then copy file (parquet or orc) >> from HDFS to MinIO. >> >> ------------------------------ >> eabour >> >> >> *From:* Prem Sahoo <[email protected]> >> *Date:* 2024-05-22 00:38 >> *To:* Vibhor Gupta <[email protected]>; user >> <[email protected]> >> *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way >> >> >> On Tue, May 21, 2024 at 6:58 AM Prem Sahoo <[email protected]> wrote: >> >>> Hello Vibhor, >>> Thanks for the suggestion . >>> I am looking for some other alternatives where I can use the same >>> dataframe can be written to two destinations without re execution and cache >>> or persist . >>> >>> Can some one help me in scenario 2 ? >>> How to make spark write to MinIO faster ? >>> Sent from my iPhone >>> >>> On May 21, 2024, at 1:18 AM, Vibhor Gupta <[email protected]> >>> wrote: >>> >>> >>> >>> Hi Prem, >>> >>> >>> >>> You can try to write to HDFS then read from HDFS and write to MinIO. >>> >>> >>> >>> This will prevent duplicate transformation. >>> >>> >>> >>> You can also try persisting the dataframe using the DISK_ONLY level. >>> >>> >>> >>> Regards, >>> >>> Vibhor >>> >>> *From: *Prem Sahoo <[email protected]> >>> *Date: *Tuesday, 21 May 2024 at 8:16 AM >>> *To: *Spark dev list <[email protected]> >>> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way >>> >>> *EXTERNAL: *Report suspicious emails to *Email Abuse.* >>> >>> Hello Team, >>> >>> I am planning to write to two datasource at the same time . >>> >>> >>> >>> Scenario:- >>> >>> >>> >>> Writing the same dataframe to HDFS and MinIO without re-executing the >>> transformations and no cache(). Then how can we make it faster ? >>> >>> >>> >>> Read the parquet file and do a few transformations and write to HDFS and >>> MinIO. >>> >>> >>> >>> here in both write spark needs execute the transformation again. Do we >>> know how we can avoid re-execution of transformation without >>> cache()/persist ? >>> >>> >>> >>> Scenario2 :- >>> >>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins. >>> >>> Do we have any way to make writing this faster ? >>> >>> >>> >>> I don't want to do repartition and write as repartition will have >>> overhead of shuffling . >>> >>> >>> >>> Please provide some inputs. >>> >>> >>> >>> >>> >>>
