DataFrames are a narrower, more specific type of abstraction, for tabular data. Where your data is tabular, it makes more sense to use, especially because this knowledge means a lot more can be optimized under the hood for you, whereas the framework can do nothing with an RDD of arbitrary objects. DataFrames are not somehow a "better RDD".
Datasets are more like the new RDDs, supporting more general objects and programmatic access. Still a different thing for a different purpose from DataFrames. But has an API more similar to DataFrames and some of the same types of benefits for simple types via Encoders. On Tue, Nov 22, 2016 at 2:50 PM jggg777 <jonrgr...@gmail.com> wrote: > I've seen a number of visuals showing the processing time benefits of using > Datasets+DataFrames over RDDs, but I'd assume that there are performance > benefits to using a defined case class instead a generic Dataset[Row]. The > tale of three Spark APIs post mentions "If you want higher degree of > type-safety at compile time, want typed JVM objects, *take advantage of > Catalyst optimization, and benefit from Tungsten’s efficient code > generation, use Dataset.*" > > Are there any comparisons showing the performance differences between > Datasets and DataFrames? Or more information about how Catalyst/Tungsten > handle them differently? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-processing-speed-difference-between-DataFrames-and-Datasets-tp28117.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >