Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Jacek Laskowski Thu, 09 Jun 2016 09:18:11 -0700

Makes sense. Thanks Michael (and welcome back from #SparkSummit!) On to
exploring the space...


Jacek
On 9 Jun 2016 6:10 p.m., "Michael Armbrust" <mich...@databricks.com> wrote:

> Look at the explain().  For a Seq we know its just local data so avoid
> spark jobs for simple operations.  In contrast, an RDD is opaque to
> catalyst so we can't perform that optimization.
>
> On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> I just noticed it today while toying with Spark 2.0.0 (today's build)
>> that doing Seq(...).toDF does **not** submit a Spark job while
>> sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been
>> thinking about the reason for the behaviour.
>>
>> My explanation was that Datasets are just a "view" layer atop data and
>> when this data is local/in memory already there's no need to submit a
>> job to...well...compute the data.
>>
>> I'd appreciate more in-depth answer, perhaps with links to the code.
>> Thanks!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

Reply via email to