[ https://issues.apache.org/jira/browse/HIVE-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xsys updated HIVE-26533: ------------------------ Description: h3. Describe the bug We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message: {code:java} WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code} Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase). h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"; 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.359 seconds spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte); 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. Time taken: 1.605 seconds spark-sql> desc hive_tinyint_avro; 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. c0 int c1 int // Data type and case-sensitivity lost Time taken: 0.068 seconds, Fetched 2 row(s){code} h3. Expected behavior We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with Parquet: {noformat} spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET; Time taken: 0.134 seconds spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte); Time taken: 0.995 seconds spark-sql> desc hive_tinyint_parquet; c0 int C1 tinyint // Data type and case-sensitivity preserved Time taken: 0.092 seconds, Fetched 2 row(s){noformat} h3. Root Cause [TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT: {code:java} case BYTE: schema = Schema.create(Schema.Type.INT); break; case SHORT: schema = Schema.create(Schema.Type.INT); break; case INT: schema = Schema.create(Schema.Type.INT); break; {code} Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance. was: h3. Describe the bug We are trying to store a table through the {{spark-sql}} interface with the {{Avro}} file format. The table's schema contains a column with the {{BYTE}} data type. Additionally, the column's name contains uppercase letters. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below message: {code:java} WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.{code} Finally, when we perform a {{DESC}} on the table, we observe that the {{BYTE}} data type has been converted to {{{}int{}}}, and the case sensitivity of the column name has been lost (it is converted to lowercase). h3. Step to reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"; 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.359 seconds spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte); 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. Time taken: 1.605 seconds spark-sql> desc hive_tinyint_avro; 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. c0 int c1 int // Data type and case-sensitivity lost Time taken: 0.068 seconds, Fetched 2 row(s){code} h3. Expected behavior We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation. Here are the logs from our attempt at doing the same with Parquet: {noformat} spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET; Time taken: 0.134 seconds spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte); Time taken: 0.995 seconds spark-sql> desc hive_tinyint_parquet; c0 int C1 tinyint // Data type and case-sensitivity preserved Time taken: 0.092 seconds, Fetched 2 row(s){noformat} h3. Root Cause [TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] is where Hive's BYTE, SHORT & INT are all converted into Avro's INT: {code:java} case BYTE: schema = Schema.create(Schema.Type.INT); break; case SHORT: schema = Schema.create(Schema.Type.INT); break; case INT: schema = Schema.create(Schema.Type.INT); break; {code} Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance. > Column data type is lost when an Avro table with a BYTE column is written > through spark-sql > ------------------------------------------------------------------------------------------- > > Key: HIVE-26533 > URL: https://issues.apache.org/jira/browse/HIVE-26533 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 3.1.2 > Reporter: xsys > Priority: Major > > h3. Describe the bug > We are trying to store a table through the {{spark-sql}} interface with the > {{Avro}} file format. The table's schema contains a column with the {{BYTE}} > data type. Additionally, the column's name contains uppercase letters. > When we {{INSERT}} some valid values (e.g. {{{}-128{}}}), we see the below > message: > {code:java} > WARN HiveExternalCatalog: The table schema given by Hive > metastore(struct<c0:int,c1:int>) is different from the schema when this table > was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to > the table schema from Hive metastore which is not case preserving.{code} > > Finally, when we perform a {{DESC}} on the table, we observe that the > {{BYTE}} data type has been converted to {{{}int{}}}, and the case > sensitivity of the column name has been lost (it is converted to lowercase). > h3. Step to reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the > Avro package: > {code:java} > ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1{code} > > Execute the following: > {code:java} > spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE > "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT > "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT > "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"; > 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > Time taken: 0.359 seconds > spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte); > 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive > metastore(struct<c0:int,c1:int>) is different from the schema when this table > was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to > the table schema from Hive metastore which is not case preserving. > 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive > metastore(struct<c0:int,c1:int>) is different from the schema when this table > was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to > the table schema from Hive metastore which is not case preserving. > Time taken: 1.605 seconds > spark-sql> desc hive_tinyint_avro; > 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive > metastore(struct<c0:int,c1:int>) is different from the schema when this table > was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to > the table schema from Hive metastore which is not case preserving. > 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive > metastore(struct<c0:int,c1:int>) is different from the schema when this table > was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to > the table schema from Hive metastore which is not case preserving. > c0 int > c1 int // Data type and case-sensitivity lost > Time taken: 0.068 seconds, Fetched 2 row(s){code} > h3. Expected behavior > We expect the case sensitivity and data type to be preserved. We tried other > formats like Parquet & ORC and the outcome is consistent with this > expectation. > Here are the logs from our attempt at doing the same with Parquet: > {noformat} > spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as > PARQUET; > Time taken: 0.134 seconds > spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte); > Time taken: 0.995 seconds > spark-sql> desc hive_tinyint_parquet; > c0 int > C1 tinyint // Data type and case-sensitivity preserved > Time taken: 0.092 seconds, Fetched 2 row(s){noformat} > h3. Root Cause > [TypeInfoToSchema|https://github.com/apache/hive/blob/8190d2be7b7165effa62bd21b7d60ef81fb0e4af/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L41]'s > > [createAvroPrimitive|https://github.com/apache/hive/blob/rel/release-3.1.2/serde/src/java/org/apache/hadoop/hive/serde2/avro/TypeInfoToSchema.java#L124-L132] > is where Hive's BYTE, SHORT & INT are all converted into Avro's INT: > {code:java} > case BYTE: > schema = Schema.create(Schema.Type.INT); > break; > case SHORT: > schema = Schema.create(Schema.Type.INT); > break; > case INT: > schema = Schema.create(Schema.Type.INT); > break; > {code} > > Once converted into Avro schema, we lose track of the actual Hive schema > specified by the user. Therefore, once TINYINT/BYTE is converted into INT, > the former is lost in the AvroSerde instance. > -- This message was sent by Atlassian Jira (v8.20.10#820010)