I have a parquet file which I reading atleast 4-5 times within my application. 
I was wondering what is most efficient thing to do.

Option 1. While writing parquet file, immediately read it back to dataset and 
call cache. I am assuming by doing an immediate read I might use some existing 
hdfs/spark cache as part from write process.

Option 2. In my application when I need the dataset first time, call cache then.

Option 3. While writing parquet file, after completion create a temp view out 
of it. In all subsequent usage, use the view.

I am also not very clear about efficiency of reading from tempview vs parquet 
dataset.

FYI the datasets which I am referring, its not possible to fit all of it in 
memory. They are very huge.

Regards..
Rohit
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to