Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Jacek Laskowski Wed, 08 Jun 2016 07:50:29 -0700

Hi,

I just noticed it today while toying with Spark 2.0.0 (today's build)
that doing Seq(...).toDF does **not** submit a Spark job while
sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
thinking about the reason for the behaviour.


My explanation was that Datasets are just a "view" layer atop data and
when this data is local/in memory already there's no need to submit a
job to...well...compute the data.

I'd appreciate more in-depth answer, perhaps with links to the code. Thanks!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Reply via email to