Hi Mich,
Ok, my use-case is a bit different.
I have a Hive table partitioned by dates and need to do dynamic partition
updates(insert overwrite) daily for the last 30 days (partitions).
The ETL inside the staging directories is completed in hardly 5minutes, but
then renaming takes a lot of time as
Spark has no role in creating that hive staging directory. That directory
belongs to Hive and Spark simply does ETL there, loading to the Hive
managed table in your case which ends up in saging directory
I suggest that you review your design and use an external hive table with
explicit location on
It does help performance but not significantly.
I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?
On Mon, 17 Jul 2023 at