The performance difference is coming from the need to serialize and
deserialize data to AnnotationText. The extra stage is probably very quick
and shouldn't impact much.
If you try cache the RDD using serialized mode, it would slow down a lot
too.
On Thu, Jul 28, 2016 at 9:52 AM, Darin McBeath
I started playing round with Datasets on Spark 2.0 this morning and I'm
surprised by the significant performance difference I'm seeing between an RDD
and a Dataset for a very basic example.
I've defined a simple case class called AnnotationText that has a handful of
fields.
I create a Datase