Hi,

How does the performance difference change when turning off compression?
It is enabled by default.

// maropu

Sent by iPhone

2016/08/28 10:13、Kazuaki Ishizaki <ishiz...@jp.ibm.com> のメッセージ:

> Hi
> I think that it is a performance issue in both DataFrame and Dataset cache. 
> It is not due to only Encoders. The DataFrame version 
> "spark.range(Int.MaxValue).toDF.cache().count()" is also slow.
> 
> While a cache for DataFrame and Dataset is stored as a columnar format with 
> some compressed data representation, we have revealed there is room to 
> improve performance. We have already created pull requests to address them. 
> These pull requests are under review. 
> https://github.com/apache/spark/pull/11956
> https://github.com/apache/spark/pull/14091
> 
> We would appreciate your feedback to these pull requests.
> 
> Best Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:        Maciej Bryński <mac...@brynski.pl>
> To:        Spark dev list <dev@spark.apache.org>
> Date:        2016/08/28 05:40
> Subject:        Cache'ing performance
> 
> 
> 
> Hi,
> I did some benchmark of cache function today.
> 
> RDD
> sc.parallelize(0 until Int.MaxValue).cache().count()
> 
> Datasets
> spark.range(Int.MaxValue).cache().count()
> 
> For me Datasets was 2 times slower.
> 
> Results (3 nodes, 20 cores and 48GB RAM each)
> RDD - 6s
> Datasets - 13,5 s
> 
> Is that expected behavior for Datasets and Encoders ?
> 
> Regards,
> -- 
> Maciek Bryński
> 

Reply via email to