Re: Parquet partitioning performance issue

Dean Wampler Sun, 13 Sep 2015 11:30:07 -0700

One general technique is perform a second pass later over the files, for
example the next day or once a week, to concatenate smaller files into
larger ones. This can be done for all file types and allows you make recent
data available to analysis tools, while avoiding a large build up of small
files overall.


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, Sep 13, 2015 at 12:54 PM, sonal sharma <sonalsharma1...@gmail.com>
wrote:

> Hi Team,
>
> We have scheduled jobs that read new records from MySQL database every
> hour and write (append) them to parquet. For each append operation, spark
> creates 10 new partitions in parquet file.
>
> Some of these partitions are fairly small in size (20-40 KB) leading to
> high number of smaller partitions and affecting the overall read
> performance.
>
> Is there any way in which we can configure spark to merge smaller
> partitions into a bigger one to avoid too many partitions? Or can we define
> a configuration in Parquet to set a minimum partition size, say 64 MB?
>
> Coalesce/repartition will not work for us as we have highly variable
> activity on the database during peak and non-peak hours.
>
> Regards,
> Sonal
>

Re: Parquet partitioning performance issue

Reply via email to