DataFrame degraded performance after DataFrame.cache

Justin Yip Tue, 07 Apr 2015 18:32:25 -0700

Hello,

I have a parquet file of around 55M rows (~ 1G on disk). Performing simple
grouping operation is pretty efficient (I get results within 10 seconds).
However, after called DataFrame.cache, I observe a significant performance
degrade, the same operation now takes 3+ minutes.


My hunch is that DataFrame cannot leverage its columnar format after
persisting in memory. But cannot find anywhere from the doc mentioning this.

Did I miss anything?

Thanks!

Justin

DataFrame degraded performance after DataFrame.cache

Reply via email to