Hi,

We are using the InsertInto method of dataframe to write into an object
store backed hive table in Google cloud. We have observed slowness in this
approach.

>From the internet, we got to know
Writes to Hive tables in Spark happen in a two-phase manner.

   - Step 1 – DistributedWrite: Data is written to a Hive staging directory
   using OutputCommitter. This step happens in a distributed manner in
   multiple executors.
   - Step 2 – FinalCopy: After the new files are written to the Hive
   staging directory, they are moved to the final location using the
   FileSystem *rename* API. This step unfortunately happens serially in the
   driver. As part of this, the metastore is also updated with the new
   partition information.

We thought of using saving the data directly in the path and then
programmatically adding the partitions and doing a msck repair table
to save time in the rename operation. Are there any other elegant ways to
implement this so that the FinalCopy step (rename API operation) can be
eliminated.
Need suggestions to speed up this write.

Few things to consider:
1. We get old data as well as new data. So there will be new partitions as
well as upserts to old partitions.
2. Insert overwrite can happen into static and dynamic partitions.

Looking forward to a solution.

Regards
Joyan

Reply via email to