RE: [Spark SQL] Does Spark group small files

2018-11-14 Thread Lienhart, Pierre (DI IZ) - AF (ext)
Hello Yann, From my understanding, when reading small files Spark will group them and load the content of each batch into the same partition so you won’t end up with 1 partition per file resulting in a huge number of very small partitions. This behavior is controlled by the spark.files.maxParti

Re: [Spark SQL] Does Spark group small files

2018-11-13 Thread Silvio Fiorito
Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using the s