One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small files overall.
Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, Sep 13, 2015 at 12:54 PM, sonal sharma <sonalsharma1...@gmail.com> wrote: > Hi Team, > > We have scheduled jobs that read new records from MySQL database every > hour and write (append) them to parquet. For each append operation, spark > creates 10 new partitions in parquet file. > > Some of these partitions are fairly small in size (20-40 KB) leading to > high number of smaller partitions and affecting the overall read > performance. > > Is there any way in which we can configure spark to merge smaller > partitions into a bigger one to avoid too many partitions? Or can we define > a configuration in Parquet to set a minimum partition size, say 64 MB? > > Coalesce/repartition will not work for us as we have highly variable > activity on the database during peak and non-peak hours. > > Regards, > Sonal >