Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small file

Parquet partitioning performance issue

2015-09-13 Thread sonal sharma
Hi Team, We have scheduled jobs that read new records from MySQL database every hour and write (append) them to parquet. For each append operation, spark creates 10 new partitions in parquet file. Some of these partitions are fairly small in size (20-40 KB) leading to high number of smaller parti