Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

Khare, Ankit Thu, 25 Apr 2019 00:59:10 -0700

Why do you need 10000 partition when 10 partition is doing the job .. ??

 Thanks
Ankit

From: vincent gromakowski <vincent.gromakow...@gmail.com>
Date: Thursday, 25. April 2019 at 09:12
To: Juho Autio <juho.au...@rovio.com>
Cc: user <user@spark.apache.org>
Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many 
partitions

Which metastore are you using?

Le jeu. 25 avr. 2019 à 09:02, Juho Autio 
<juho.au...@rovio.com<mailto:juho.au...@rovio.com>> a écrit :
Would anyone be able to answer this question about the non-optimal 
implementation of insertInto?

On Thu, Apr 18, 2019 at 4:45 PM Juho Autio 
<juho.au...@rovio.com<mailto:juho.au...@rovio.com>> wrote:
Hi,

My job is writing ~10 partitions with insertInto. With the same input / output 
data the total duration of the job is very different depending on how many 
partitions the target table has.

Target table with 10 of partitions:
1 min 30 s

Target table with ~10000 partitions:
13 min 0 s

It seems that spark is always fetching the full list of partitions in target 
table. When this happens, the cluster is basically idling while driver is 
listing partitions.

Here's a thread dump for executor driver from such idle time:
https://gist.github.com/juhoautio/bdbc8eb339f163178905322fc393da20

Is there any way to optimize this currently? Is this a known issue? Any plans 
to improve?

My code is essentially:

spark = SparkSession.builder \
    .config('spark.sql.hive.caseSensitiveInferenceMode', 'NEVER_INFER') \
    .config("hive.exec.dynamic.partition", "true") \
    .config('spark.sql.sources.partitionOverwriteMode', 'dynamic') \
    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
    .enableHiveSupport() \
    .getOrCreate()

out_df.write \
    .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
    .insertInto(target_table_name, overwrite=True)

Table has been originally created from spark with saveAsTable.

Does spark need to know anything about the existing partitions though? As a 
manual workaround I would write the files directly to the partition locations, 
delete existing files first if there's anything in that partition, and then 
call metastore to ALTER TABLE IF NOT EXISTS ADD PARTITION. This doesn't require 
previous knowledge on existing partitions.

Thanks.

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

Reply via email to