[Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

Pradeep Thu, 01 May 2025 22:52:59 -0700

I have a partitioned hive external table as below

scala> spark.sql("describe extended db1.table1").show(100,false)
+----------------------------+--------------------------------------------------------------+
|col_name                    |data_type
                     |
+----------------------------+--------------------------------------------------------------+
|name                        |string
                     |
|event_partition             |string
                     |
|# Partition Information     |
                     |
|# col_name                  |data_type
                     |
|event_partition             |string
                     |
|                            |
                     |
|# Detailed Table Information|
                     |
|Catalog                     |spark_catalog
                     |
|Database                    |db1
                     |
|Table                       |table1
                     |
|Owner                       |root
                     |
|Created Time                |Tue Apr 15 15:30:00 UTC 2025
                     |
|Last Access                 |UNKNOWN
                     |
|Created By                  |Spark 3.5.3
                     |
|Type                        |EXTERNAL
                     |
|Provider                    |hive
                     |
|Table Properties            |[transient_lastDdlTime=1746110529]
                     |
|Location                    |gs://my-bucket/db1/table1
                     |
|Serde Library
|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |
|InputFormat
|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
|OutputFormat
|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
|Storage Properties          |[serialization.format=1]
                     |
|Partition Provider          |Catalog
                     |
+----------------------------+--------------------------------------------------------------+



Below is my existing partition created via spark sql

scala> val catalogPartitions =
spark.sharedState.externalCatalog.listPartitions("db1", "table1")
scala> val partitionValues = catalogPartitions.foreach(cp => {
|   val partitionSpec = cp.spec
|   println(partitionSpec + " Parameters:" + cp.parameters + "
lastAccessTime:" + cp.lastAccessTime + " createTime:" + cp.createTime)
| })
Map(event_partition -> 2024-01-04)
Parameters:Map(**transient_lastDdlTime -> 1744731019**, totalSize ->
475, numFiles -> 1) lastAccessTime:0 createTime:1744731019000
partitionValues: Unit = ()

when I insert new partition, transient_lastddltime is updated in hive
metastore

scala> spark.sql("insert overwrite table db1.table1
partition(event_partition) select 'A','2024-01-05'").show()
++
||
++
++

scala> val catalogPartitions =
spark.sharedState.externalCatalog.listPartitions("db1", "table1")
scala> val partitionValues = catalogPartitions.foreach(cp => {
|   val partitionSpec = cp.spec
|   println(partitionSpec + " Parameters:" + cp.parameters + "
lastAccessTime:" + cp.lastAccessTime + " createTime:" + cp.createTime)
| })
Map(event_partition -> 2024-01-04)
Parameters:Map(transient_lastDdlTime -> 1744731019, totalSize -> 475,
numFiles -> 1) lastAccessTime:0 createTime:1744731019000
Map(event_partition -> 2024-01-05)
Parameters:Map(**transient_lastDdlTime -> 1746112922**, totalSize ->
455, numFiles -> 1) lastAccessTime:0 createTime:1746112922000
partitionValues: Unit = ()


when I insert overwrite the same partition, transient_lastddltime and
column_stats are not getting updated. This used to work in spark 2.4

scala> spark.sql("insert overwrite table db1.table1
partition(event_partition) select 'B','2024-01-05'").show()
++
||
++
++

scala> val catalogPartitions =
spark.sharedState.externalCatalog.listPartitions("db1", "table1")
scala> val partitionValues = catalogPartitions.foreach(cp => {
|   val partitionSpec = cp.spec
|   println(partitionSpec + " Parameters:" + cp.parameters + "
lastAccessTime:" + cp.lastAccessTime + " createTime:" + cp.createTime)
| })
Map(event_partition -> 2024-01-04)
Parameters:Map(transient_lastDdlTime -\> 1744731019, totalSize -> 475,
numFiles -> 1) lastAccessTime:0 createTime:1744731019000
Map(event_partition -> 2024-01-05)
Parameters:Map(**transient_lastDdlTime -> 1746112922**, totalSize ->
455, numFiles -> 1) lastAccessTime:0 createTime:1746112922000
partitionValues: Unit = ()


I'm performing the same insert overwrite in hive for the same partition and
this is updating the transient_lastddltime and column stats,

hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive>
insert overwrite table db1.table1 partition(event_partition)
select 'B','2024-01-05' union all
select 'C','2024-01-06';
Query ID = pradeep_20250501152449_33c17f33-8084-4c55-a49b-6d0fe99b17e5
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id
application_1746108119609_0009)

----------------------------------------------------------------------------------------------

        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
PENDING  FAILED  KILLED

----------------------------------------------------------------------------------------------

Map 1 .......... container     SUCCEEDED      1          1        0
    0       0       0
Reducer 3 ...... container     SUCCEEDED      1          1        0
    0       0       0
Map 4 .......... container     SUCCEEDED      1          1        0
    0       0       0
-

VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 18.37 s
-

Loading data to table db1.table1 partition (event_partition=null)

Loaded : 2/2 partitions.
Time taken to load dynamic partitions: 1.763 seconds
Time taken for adding to write entity : 0.002 seconds
OK
Time taken: 90.182 seconds

scala> val catalogPartitions =
spark.sharedState.externalCatalog.listPartitions("db1", "table1")
scala> val partitionValues = catalogPartitions.foreach(cp => {
|   val partitionSpec = cp.spec
|   println(partitionSpec + " Parameters:" + cp.parameters + "
lastAccessTime:" + cp.lastAccessTime + " createTime:" + cp.createTime)
| })
Map(event_partition -> 2024-01-04)
Parameters:Map(transient_lastDdlTime -> 1744731019, totalSize -> 475,
numFiles -> 1) lastAccessTime:0 createTime:1744731019000
Map(event_partition -> 2024-01-05) Parameters:Map(rawDataSize -> 1,
numFiles -> 1, *transient_lastDdlTime -> 1746113178*, totalSize ->
316, **COLUMN_STATS_ACCURATE ->
{"BASIC_STATS":"true","COLUMN_STATS":{"name":"true"}}, numRows -> 1**)
lastAccessTime:0 createTime:1746112922000
Map(event_partition -> 2024-01-06) Parameters:Map(rawDataSize -> 1,
numFiles -> 1, transient_lastDdlTime -> 1746113178, totalSize -> 316,
COLUMN_STATS_ACCURATE ->
{"BASIC_STATS":"true","COLUMN_STATS":{"name":"true"}}, numRows -> 1)
lastAccessTime:0 createTime:0
partitionValues: Unit = ()

Below is what I implemented in spark2.4 and used to work and after
upgrading to spark3.x, this functionality is broken. Get all the new
partitions that are written to Hive metastore by Spark
<https://stackoverflow.com/questions/57202917/get-all-the-new-partitions-that-are-written-to-hive-metastore-by-spark>

Regards,
Pradeep

[Spark SQL] spark.sql insert overwrite on existing partition not updating hive metastore partition transient_lastddltime and column_stats

Reply via email to