Thanks Akhil.
I searched DISK_AND_MEMORY_SER trying to figure out how it works, and I
cannot find any documentation on that. Do you have a link for that?
If what DISK_AND_MEMORY_SER does is reading and writing to the disk with
some memory caching, does that mean the output will be written to disk
You can use spark-sql to solve this usecase, and you don't need to have
800G of memory (but of course if you are caching the whole data into
memory, then you would need it.). You can persist the data by setting
DISK_AND_MEMORY_SER property if you don't want to bring whole data into
memory, in this