How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Hello, One of the main "selling points" of Spark is that unlike Hadoop map-reduce that persists intermediate results of its computation to HDFS (disk), Spark keeps all its results in memory. I don't understand this as in reality when a Spark stage finishes[it writes all of the data into shuffle

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
e will have the same amount of IO writes as there are stages. thanks, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 3:34 PM, Sid wrote: > Hi Krexos, > > If I understand correctly, you are trying to ask that even spark involves > disk i/o then how it is a

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Don't stages by definition include a shuffle? If you didn't need a shuffle between 2 stages you could merge them into one stage. thanks, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 4:13 PM, Sean Owen wrote: > Because only shuffle stages write shuffl

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Isn't Spark the same in this regard? You can execute all of the narrow dependencies of a Spark stage in one mapper, thus having the same amount of mappers + reducers as spark stages for the same job, no? thanks, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 4:

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
which saves about 2 times the IO thanks everyone, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 1:35 PM, krexos wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results o

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
than MR --- Original Message --- On Saturday, July 2nd, 2022 at 5:27 PM, Sid wrote: > I have explained the same thing in a very layman's terms. Go through it once. > > On Sat, 2 Jul 2022, 19:45 krexos, wrote: > >> I think I understand where Spark saves IO. >> &

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
pers, but harder to use. Not impossible but you > could also say Spark just made it easier to do the more efficient thing. > > On Sat, Jul 2, 2022, 9:34 AM krexos wrote: > >> You said Spark performs IO only when reading data and writing final data to >> the disk. I though by

Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

2023-01-21 Thread krexos
My periodically running process writes data to a table over parquet files with the configuration"spark.sql.sources.partitionOverwriteMode" = "dynamic"with the following code: if (!tableExists) { df.write .mode( "overwrite" ) .partitionBy( "partitionCol" ) .format( "parquet"

Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

2023-01-21 Thread krexos
an wrote: > In the case of saveAsTable("tablename") you specified the partition: > 'partitionBy("partitionCol")' > > On Sat, Jan 21, 2023 at 4:03 AM krexos wrote: > >> My periodically running process writes