I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do.
Option 1. While writing parquet file, immediately read it back to dataset and call cache. I am assuming by doing an immediate read I might use some existing hdfs/spark cache as part from write process. Option 2. In my application when I need the dataset first time, call cache then. Option 3. While writing parquet file, after completion create a temp view out of it. In all subsequent usage, use the view. I am also not very clear about efficiency of reading from tempview vs parquet dataset. FYI the datasets which I am referring, its not possible to fit all of it in memory. They are very huge. Regards.. Rohit --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org