Hi Amit, This is very interesting indeed because I have got similar resutls. I tried doing a filtter + groupBy using DataSet with a function, and using the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because apparently there is no straight-forward way to create an RDD of Parquet data without creating a sqlContext. if anybody has some code to share with me, please share (: I used 1GB of parquet data and when doing the operations with the RDD it was much faster. After looking at the execution plans, it is clear why DataSets do worse. For using them an extra map operation is done to map row objects into the defined case class. Then the DataSet uses the whole query optimization platform (Catalyst and move objects in and out of Tungsten). Thus, I think for operations that are too "simple", it is more expensive to use the entire DS/DF infrastructure than the inner RDD. IMHO if you have complex SQL queries, it makes sense you use DS/DF but if you don't, then probably using RDDs directly is still faster.
Renato M. 2016-05-11 20:17 GMT+02:00 Amit Sela <[email protected]>: > Some how missed that ;) > Anything about Datasets slowness ? > > On Wed, May 11, 2016, 21:02 Ted Yu <[email protected]> wrote: > >> Which release are you using ? >> >> You can use the following to disable UI: >> --conf spark.ui.enabled=false >> >> On Wed, May 11, 2016 at 10:59 AM, Amit Sela <[email protected]> wrote: >> >>> I've ran a simple WordCount example with a very small List<String> as >>> input lines and ran it in standalone (local[*]), and Datasets is very slow.. >>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec. >>> Is this just start-up overhead ? please note that I'm not timing the >>> context creation... >>> >>> And in general, is there a way to run with local[*] "lightweight" mode >>> for testing ? something like without the WebUI server for example (and >>> anything else that's not needed for testing purposes) >>> >>> Thanks, >>> Amit >>> >> >>
