Why do you need 10000 partition when 10 partition is doing the job .. ?? Thanks Ankit
From: vincent gromakowski <vincent.gromakow...@gmail.com> Date: Thursday, 25. April 2019 at 09:12 To: Juho Autio <juho.au...@rovio.com> Cc: user <user@spark.apache.org> Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions Which metastore are you using? Le jeu. 25 avr. 2019 à 09:02, Juho Autio <juho.au...@rovio.com<mailto:juho.au...@rovio.com>> a écrit : Would anyone be able to answer this question about the non-optimal implementation of insertInto? On Thu, Apr 18, 2019 at 4:45 PM Juho Autio <juho.au...@rovio.com<mailto:juho.au...@rovio.com>> wrote: Hi, My job is writing ~10 partitions with insertInto. With the same input / output data the total duration of the job is very different depending on how many partitions the target table has. Target table with 10 of partitions: 1 min 30 s Target table with ~10000 partitions: 13 min 0 s It seems that spark is always fetching the full list of partitions in target table. When this happens, the cluster is basically idling while driver is listing partitions. Here's a thread dump for executor driver from such idle time: https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20 Is there any way to optimize this currently? Is this a known issue? Any plans to improve? My code is essentially: spark = SparkSession.builder \ .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER') \ .config("hive.exec.dynamic.partition", "true") \ .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \ .enableHiveSupport() \ .getOrCreate() out_df.write \ .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \ .insertInto(target_table_name, overwrite=True) Table has been originally created from spark with saveAsTable. Does spark need to know anything about the existing partitions though? As a manual workaround I would write the files directly to the partition locations, delete existing files first if there's anything in that partition, and then call metastore to ALTER TABLE IF NOT EXISTS ADD PARTITION. This doesn't require previous knowledge on existing partitions. Thanks.