Hi,

We have many Spark jobs that create multiple small files. We would like to
improve analyst reading performance, doing so I'm testing the parquet
optimal file size.
I've found that the optimal file size should be around 1GB, and not less
than 128MB, depending on the size of the data.

I took one process to examine, in my process I'm using shuffle partitions =
600, which creates files of size 11MB. I've added a repartition part to
recreate less files - ~12 files of 600gb. After testing it (select * from
table where ...) I saw that the old version (with more files) ran faster
than the new one. I tried to increase the num of files to 40 - ~130MB each
file, and still it runs slower.

Would appreciate your experience with file sizes, and how to optimize the
num and size of files.

Thanks,
Tzahi

Reply via email to