yeah makes sense, also is there any massive performance improvement using bucketBy in comparison to sorting?
Regards, Gourav On Thu, Jul 4, 2019 at 1:34 PM Silvio Fiorito <silvio.fior...@granturing.com> wrote: > You need to first repartition (at a minimum by bucketColumn1) since each > task will write out the buckets/files. If the bucket keys are distributed > randomly across the RDD partitions, then you will get multiple files per > bucket. > > > > *From: *Arwin Tio <arwin....@hotmail.com> > *Date: *Thursday, July 4, 2019 at 3:22 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Parquet 'bucketBy' creates a ton of files > > > > I am trying to use Spark's **bucketBy** feature on a pretty large dataset. > > > > ```java > > dataframe.write() > > .format("parquet") > > .bucketBy(500, bucketColumn1, bucketColumn2) > > .mode(SaveMode.Overwrite) > > .option("path", "s3://my-bucket") > > .saveAsTable("my_table"); > > ``` > > > > The problem is that my Spark cluster has about 500 > partitions/tasks/executors (not sure the terminology), so I end up with > files that look like: > > > > ``` > > part-00001-{UUID}_00001.c000.snappy.parquet > > part-00001-{UUID}_00002.c000.snappy.parquet > > ... > > part-00001-{UUID}_00500.c000.snappy.parquet > > > > part-00002-{UUID}_00001.c000.snappy.parquet > > part-00002-{UUID}_00002.c000.snappy.parquet > > ... > > part-00002-{UUID}_00500.c000.snappy.parquet > > > > part-00500-{UUID}_00001.c000.snappy.parquet > > part-00500-{UUID}_00002.c000.snappy.parquet > > ... > > part-00500-{UUID}_00500.c000.snappy.parquet > > ``` > > > > That's 500x500=250000 bucketed parquet files! It takes forever for the > `FileOutputCommitter` to commit that to S3. > > > > Is there a way to generate **one file per bucket**, like in Hive? Or is > there a better way to deal with this problem? As of now it seems like I > have to choose between lowering the parallelism of my cluster (reduce > number of writers) or reducing the parallelism of my parquet files (reduce > number of buckets), which will lower the parallelism of my downstream jobs. > > > > Thanks >