One general technique is perform a second pass later over the files, for
example the next day or once a week, to concatenate smaller files into
larger ones. This can be done for all file types and allows you make recent
data available to analysis tools, while avoiding a large build up of small
file
Hi Team,
We have scheduled jobs that read new records from MySQL database every hour
and write (append) them to parquet. For each append operation, spark
creates 10 new partitions in parquet file.
Some of these partitions are fairly small in size (20-40 KB) leading to
high number of smaller parti