Hi I think that it is a performance issue in both DataFrame and Dataset cache. It is not due to only Encoders. The DataFrame version "spark.range(Int.MaxValue).toDF.cache().count()" is also slow.
While a cache for DataFrame and Dataset is stored as a columnar format with some compressed data representation, we have revealed there is room to improve performance. We have already created pull requests to address them. These pull requests are under review. https://github.com/apache/spark/pull/11956 https://github.com/apache/spark/pull/14091 We would appreciate your feedback to these pull requests. Best Regards, Kazuaki Ishizaki From: Maciej Bryński <mac...@brynski.pl> To: Spark dev list <dev@spark.apache.org> Date: 2016/08/28 05:40 Subject: Cache'ing performance Hi, I did some benchmark of cache function today. RDD sc.parallelize(0 until Int.MaxValue).cache().count() Datasets spark.range(Int.MaxValue).cache().count() For me Datasets was 2 times slower. Results (3 nodes, 20 cores and 48GB RAM each) RDD - 6s Datasets - 13,5 s Is that expected behavior for Datasets and Encoders ? Regards, -- Maciek Bryński