Hello,
One of the main "selling points" of Spark is that unlike Hadoop map-reduce that
persists intermediate results of its computation to HDFS (disk), Spark keeps
all its results in memory. I don't understand this as in reality when a Spark
stage finishes[it writes all of the data into shuffle
e will have the same amount of
IO writes as there are stages.
thanks,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 3:34 PM, Sid wrote:
> Hi Krexos,
>
> If I understand correctly, you are trying to ask that even spark involves
> disk i/o then how it is a
Don't stages by definition include a shuffle? If you didn't need a shuffle
between 2 stages you could merge them into one stage.
thanks,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 4:13 PM, Sean Owen wrote:
> Because only shuffle stages write shuffl
Isn't Spark the same in this regard? You can execute all of the narrow
dependencies of a Spark stage in one mapper, thus having the same amount of
mappers + reducers as spark stages for the same job, no?
thanks,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 4:
which saves about 2 times the IO
thanks everyone,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 1:35 PM, krexos wrote:
> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce
> that persists intermediate results o
than MR
--- Original Message ---
On Saturday, July 2nd, 2022 at 5:27 PM, Sid wrote:
> I have explained the same thing in a very layman's terms. Go through it once.
>
> On Sat, 2 Jul 2022, 19:45 krexos, wrote:
>
>> I think I understand where Spark saves IO.
>>
&
pers, but harder to use. Not impossible but you
> could also say Spark just made it easier to do the more efficient thing.
>
> On Sat, Jul 2, 2022, 9:34 AM krexos wrote:
>
>> You said Spark performs IO only when reading data and writing final data to
>> the disk. I though by
My periodically running process writes data to a table over parquet files with
the configuration"spark.sql.sources.partitionOverwriteMode" = "dynamic"with the
following code:
if
(!tableExists) {
df.write
.mode(
"overwrite"
)
.partitionBy(
"partitionCol"
)
.format(
"parquet"
an
wrote:
> In the case of saveAsTable("tablename") you specified the partition:
> 'partitionBy("partitionCol")'
>
> On Sat, Jan 21, 2023 at 4:03 AM krexos wrote:
>
>> My periodically running process writes