Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes.
My hunch is that DataFrame cannot leverage its columnar format after persisting in memory. But cannot find anywhere from the doc mentioning this. Did I miss anything? Thanks! Justin