Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:   Maciej Bryński <mac...@brynski.pl>
To:     Spark dev list <dev@spark.apache.org>
Date:   2016/08/28 05:40
Subject:        Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bryński


Reply via email to