Makes sense. Thanks Michael (and welcome back from #SparkSummit!) On to exploring the space...
Jacek On 9 Jun 2016 6:10 p.m., "Michael Armbrust" <mich...@databricks.com> wrote: > Look at the explain(). For a Seq we know its just local data so avoid > spark jobs for simple operations. In contrast, an RDD is opaque to > catalyst so we can't perform that optimization. > > On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi, >> >> I just noticed it today while toying with Spark 2.0.0 (today's build) >> that doing Seq(...).toDF does **not** submit a Spark job while >> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been >> thinking about the reason for the behaviour. >> >> My explanation was that Datasets are just a "view" layer atop data and >> when this data is local/in memory already there's no need to submit a >> job to...well...compute the data. >> >> I'd appreciate more in-depth answer, perhaps with links to the code. >> Thanks! >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark http://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >