Taking it to a more basic level, I compared between a simple transformation
with RDDs and with Datasets. This is far simpler than Renato's use case and
this brungs up two good question:
1. Is the time it takes to "spin-up" a standalone instance of Spark(SQL) is
just an additional one-time overhead - which is reasonable, especially for
the first version of datasets..
2. Is Datasets, in some cases, slower than RDDs ? if so in which, and why ?
*Datasets code*: ~2000 msec
SQLContext sqc = createSQLContext(createContext());
sqc.createDataset(WORDS, Encoders.STRING())
.map(new MapFunction<String, String>() {
@Override
public String call(String value) throws Exception {
return value.toUpperCase();
}
}, Encoders.STRING())
.show();
*RDDs code*: < 500 msec
JavaSparkContext jsc = createContext();
List<String> res = jsc.parallelize(WORDS)
.map(new Function<String, String>() {
@Override
public String call(String v1) throws Exception {
return v1.toUpperCase();
}
})
.collect();
*Those are the context creation functions:*
* static SQLContext createSQLContext(JavaSparkContext jsc) {*
* return new SQLContext(jsc);*
* }*
* static JavaSparkContext createContext() {*
* return new JavaSparkContext(new
SparkConf().setMaster("local[*]").setAppName("WordCount")*
* .set("spark.ui.enabled", "false"));*
* }*
*And the input:*
*List<String> WORDS = Arrays.asList("hi there", "hi", "hi sue bob", "hi
sue", "bob hi");*
On Thu, May 12, 2016 at 12:04 PM Renato Marroquín Mogrovejo <
[email protected]> wrote:
> Hi Amit,
>
> This is very interesting indeed because I have got similar resutls. I
> tried doing a filtter + groupBy using DataSet with a function, and using
> the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame
> because apparently there is no straight-forward way to create an RDD of
> Parquet data without creating a sqlContext. if anybody has some code to
> share with me, please share (:
> I used 1GB of parquet data and when doing the operations with the RDD it
> was much faster. After looking at the execution plans, it is clear why
> DataSets do worse. For using them an extra map operation is done to map row
> objects into the defined case class. Then the DataSet uses the whole query
> optimization platform (Catalyst and move objects in and out of Tungsten).
> Thus, I think for operations that are too "simple", it is more expensive to
> use the entire DS/DF infrastructure than the inner RDD.
> IMHO if you have complex SQL queries, it makes sense you use DS/DF but if
> you don't, then probably using RDDs directly is still faster.
>
>
> Renato M.
>
> 2016-05-11 20:17 GMT+02:00 Amit Sela <[email protected]>:
>
>> Some how missed that ;)
>> Anything about Datasets slowness ?
>>
>> On Wed, May 11, 2016, 21:02 Ted Yu <[email protected]> wrote:
>>
>>> Which release are you using ?
>>>
>>> You can use the following to disable UI:
>>> --conf spark.ui.enabled=false
>>>
>>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela <[email protected]>
>>> wrote:
>>>
>>>> I've ran a simple WordCount example with a very small List<String> as
>>>> input lines and ran it in standalone (local[*]), and Datasets is very
>>>> slow..
>>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>>>> Is this just start-up overhead ? please note that I'm not timing the
>>>> context creation...
>>>>
>>>> And in general, is there a way to run with local[*] "lightweight" mode
>>>> for testing ? something like without the WebUI server for example (and
>>>> anything else that's not needed for testing purposes)
>>>>
>>>> Thanks,
>>>> Amit
>>>>
>>>
>>>
>