Hi, How does the performance difference change when turning off compression? It is enabled by default.
// maropu Sent by iPhone 2016/08/28 10:13、Kazuaki Ishizaki <ishiz...@jp.ibm.com> のメッセージ: > Hi > I think that it is a performance issue in both DataFrame and Dataset cache. > It is not due to only Encoders. The DataFrame version > "spark.range(Int.MaxValue).toDF.cache().count()" is also slow. > > While a cache for DataFrame and Dataset is stored as a columnar format with > some compressed data representation, we have revealed there is room to > improve performance. We have already created pull requests to address them. > These pull requests are under review. > https://github.com/apache/spark/pull/11956 > https://github.com/apache/spark/pull/14091 > > We would appreciate your feedback to these pull requests. > > Best Regards, > Kazuaki Ishizaki > > > > From: Maciej Bryński <mac...@brynski.pl> > To: Spark dev list <dev@spark.apache.org> > Date: 2016/08/28 05:40 > Subject: Cache'ing performance > > > > Hi, > I did some benchmark of cache function today. > > RDD > sc.parallelize(0 until Int.MaxValue).cache().count() > > Datasets > spark.range(Int.MaxValue).cache().count() > > For me Datasets was 2 times slower. > > Results (3 nodes, 20 cores and 48GB RAM each) > RDD - 6s > Datasets - 13,5 s > > Is that expected behavior for Datasets and Encoders ? > > Regards, > -- > Maciek Bryński >