Re: RDD vs Dataset performance

2016-07-28 Thread Reynold Xin
The performance difference is coming from the need to serialize and deserialize data to AnnotationText. The extra stage is probably very quick and shouldn't impact much. If you try cache the RDD using serialized mode, it would slow down a lot too. On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath

RDD vs Dataset performance

2016-07-28 Thread Darin McBeath
I started playing round with Datasets on Spark 2.0 this morning and I'm surprised by the significant performance difference I'm seeing between an RDD and a Dataset for a very basic example. I've defined a simple case class called AnnotationText that has a handful of fields. I create a Datase