On Tue, May 21, 2024 at 6:58 AM Prem Sahoo <prem.re...@gmail.com> wrote:
> Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one help me in scenario 2 ? > How to make spark write to MinIO faster ? > Sent from my iPhone > > On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> > wrote: > > > > Hi Prem, > > > > You can try to write to HDFS then read from HDFS and write to MinIO. > > > > This will prevent duplicate transformation. > > > > You can also try persisting the dataframe using the DISK_ONLY level. > > > > Regards, > > Vibhor > > *From: *Prem Sahoo <prem.re...@gmail.com> > *Date: *Tuesday, 21 May 2024 at 8:16 AM > *To: *Spark dev list <d...@spark.apache.org> > *Subject: *EXT: Dual Write to HDFS and MinIO in faster way > > *EXTERNAL: *Report suspicious emails to *Email Abuse.* > > Hello Team, > > I am planning to write to two datasource at the same time . > > > > Scenario:- > > > > Writing the same dataframe to HDFS and MinIO without re-executing the > transformations and no cache(). Then how can we make it faster ? > > > > Read the parquet file and do a few transformations and write to HDFS and > MinIO. > > > > here in both write spark needs execute the transformation again. Do we > know how we can avoid re-execution of transformation without > cache()/persist ? > > > > Scenario2 :- > > I am writing 3.2G data to HDFS and MinIO which takes ~6mins. > > Do we have any way to make writing this faster ? > > > > I don't want to do repartition and write as repartition will have overhead > of shuffling . > > > > Please provide some inputs. > > > > > >