Hi there. I also posted this problem in the spark list. I am no sure this is a spark or a hive metastore problem. Or if there is some metastore tunning configuration as workaround.
Spark can't see hive schema updates partly because it stores the schema in a weird way in hive metastore. 1. FROM SPARK: create a table ============ >>> spark.sql("select 1 col1, 2 >>> col2").write.format("parquet").saveAsTable("my_table") >>> spark.table("my_table").printSchema() root |-- col1: integer (nullable = true) |-- col2: integer (nullable = true) 2. FROM HIVE: alter the schema ========== 0: jdbc:hive2://localhost:10000> ALTER TABLE my_table REPLACE COLUMNS(`col1` int, `col2` int, `col3` string); 0: jdbc:hive2://localhost:10000> describe my_table; +-----------+------------+----------+ | col_name | data_type | comment | +-----------+------------+----------+ | col1 | int | | | col2 | int | | | col3 | string | | +-----------+------------+----------+ 3. FROM SPARK: problem, column does not appear ============== >>> spark.table("my_table").printSchema() root |-- col1: integer (nullable = true) |-- col2: integer (nullable = true) 4. FROM METASTORE DB: two ways of storing the columns ====================== metastore=# select * from "COLUMNS_V2"; CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX -------+---------+-------------+-----------+------------- 2 | | col1 | int | 0 2 | | col2 | int | 1 2 | | col3 | string | 2 metastore=# select * from "TABLE_PARAMS"; TBL_ID | PARAM_KEY | PARAM_VALUE --------+-----------------------------------+----------------------------------------------------------------------------------------------------------------------------- ------------------------------- 1 | spark.sql.sources.provider | parquet 1 | spark.sql.sources.schema.part.0 | {"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}},{"name":"col2","type":"integer","n ullable":true,"metadata":{}}]} 1 | spark.sql.create.version | 2.4.8 1 | spark.sql.sources.schema.numParts | 1 1 | last_modified_time | 1641483180 1 | transient_lastDdlTime | 1641483180 1 | last_modified_by | anonymous metastore=# truncate "TABLE_PARAMS"; TRUNCATE TABLE 5. FROM SPARK: now the column magically appears ============== >>> spark.table("my_table").printSchema() root |-- col1: integer (nullable = true) |-- col2: integer (nullable = true) |-- col3: string (nullable = true) Then is it necessary to store that stuff in the TABLE_PARAMS ?