Hello Yann,
From my understanding, when reading small files Spark will group them and load
the content of each batch into the same partition so you won’t end up with 1
partition per file resulting in a huge number of very small partitions. This
behavior is controlled by the spark.files.maxParti
Yes, it does bin-packing for small files which is a good thing so you avoid
having many small partitions especially if you’re writing this data back out
(e.g. it’s compacting as you read). The default partition size is 128MB with a
4MB “cost” for opening files. You can configure this using the s