Re: Is there a processing speed difference between DataFrames and Datasets?

Sean Owen Tue, 22 Nov 2016 06:56:19 -0800

DataFrames are a narrower, more specific type of abstraction, for tabular
data. Where your data is tabular, it makes more sense to use, especially
because this knowledge means a lot more can be optimized under the hood for
you, whereas the framework can do nothing with an RDD of arbitrary objects.
DataFrames are not somehow a "better RDD".


Datasets are more like the new RDDs, supporting more general objects and
programmatic access. Still a different thing for a different purpose from
DataFrames. But has an API more similar to DataFrames and some of the same
types of benefits for simple types via Encoders.

On Tue, Nov 22, 2016 at 2:50 PM jggg777 <jonrgr...@gmail.com> wrote:

> I've seen a number of visuals showing the processing time benefits of using
> Datasets+DataFrames over RDDs, but I'd assume that there are performance
> benefits to using a defined case class instead a generic Dataset[Row].  The
> tale of three Spark APIs post mentions "If you want higher degree of
> type-safety at compile time, want typed JVM objects, *take advantage of
> Catalyst optimization, and benefit from Tungsten’s efficient code
> generation, use Dataset.*"
>
> Are there any comparisons showing the performance differences between
> Datasets and DataFrames?  Or more information about how Catalyst/Tungsten
> handle them differently?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-processing-speed-difference-between-DataFrames-and-Datasets-tp28117.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Is there a processing speed difference between DataFrames and Datasets?

Reply via email to