Hi there.

I also posted this problem in the spark list. I am no sure this is a
spark or a hive metastore problem. Or if there is some metastore tunning
configuration as workaround.


Spark can't see hive schema updates partly because it stores the schema
in a weird way in hive metastore.


1. FROM SPARK: create a table
============
>>> spark.sql("select 1 col1, 2 
>>> col2").write.format("parquet").saveAsTable("my_table")
>>> spark.table("my_table").printSchema()
root
|-- col1: integer (nullable = true)
|-- col2: integer (nullable = true)


2. FROM HIVE: alter the schema
==========
0: jdbc:hive2://localhost:10000> ALTER TABLE my_table REPLACE
COLUMNS(`col1` int, `col2` int, `col3` string);
0: jdbc:hive2://localhost:10000> describe my_table;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| col1 | int | |
| col2 | int | |
| col3 | string | |
+-----------+------------+----------+


3. FROM SPARK: problem, column does not appear
==============
>>> spark.table("my_table").printSchema()
root
|-- col1: integer (nullable = true)
|-- col2: integer (nullable = true)


4. FROM METASTORE DB: two ways of storing the columns
======================
metastore=# select * from "COLUMNS_V2";
CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
-------+---------+-------------+-----------+-------------
2 | | col1 | int | 0
2 | | col2 | int | 1
2 | | col3 | string | 2


metastore=# select * from "TABLE_PARAMS";
TBL_ID | PARAM_KEY | PARAM_VALUE

--------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------
-------------------------------
1 | spark.sql.sources.provider | parquet
1 | spark.sql.sources.schema.part.0 |
{"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}},{"name":"col2","type":"integer","n
ullable":true,"metadata":{}}]}
1 | spark.sql.create.version | 2.4.8
1 | spark.sql.sources.schema.numParts | 1
1 | last_modified_time | 1641483180
1 | transient_lastDdlTime | 1641483180
1 | last_modified_by | anonymous

metastore=# truncate "TABLE_PARAMS";
TRUNCATE TABLE


5. FROM SPARK: now the column magically appears
==============
>>> spark.table("my_table").printSchema()
root
|-- col1: integer (nullable = true)
|-- col2: integer (nullable = true)
|-- col3: string (nullable = true)


Then is it necessary to store that stuff in the TABLE_PARAMS ?


Reply via email to