[ https://issues.apache.org/jira/browse/HIVE-28632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Araika Singh updated HIVE-28632: -------------------------------- Description: *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted boolean strings without considering case sensitivity, resulting in {{false}} for any non-standard strings. Now, this behavior is case-sensitive, resulting in {{false}} for all non-conforming strings. To follow SQL standards, any string other than "false" (case-insensitive) should be treated as {{{}true{}}}. There are also inconsistencies between outputs when using an INSERT statement (which follows SQL standards) and loading a JSON value directly. These inconsistencies need to be fixed to ensure uniform output. *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings, allowing users to directly load JSON text to achieve desired results. Post HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for any other format. It's important to note that INSERT statements work seamlessly as the values are serialized to base64 and then deserialized. Hive should be capable of detecting if a JSON node is base64-encoded or not; if not, it should default to deserializing as a raw string. +To enhance user experience and maintain SQL standard compliance, we need to address the following:+ # Ensure case-sensitive handling of boolean strings in JSON deserialization and resolve inconsistencies between INSERT statements and direct JSON loading. # Implement automatic detection and appropriate deserialization of JSON nodes, distinguishing between base64-encoded and raw strings. Steps to replicate binary handling issue: {code:java} create table json_serde_table ( binarycolumn1 binary, binarycolumn2 binary, binarycolumn3 binary, binarycolumn4 binary, binarycolumn5 binary, binarycolumn6 binary ) row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'; LOAD DATA INPATH '/path/to/jsonserde.txt' INTO TABLE json_serde_table; select * from json_serde_table; {code} {noformat} Error stack trace: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary type from record. java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary type from record. at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535) at org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194) at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary type from record. at org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:191) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:609) ... 18 more Caused by: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary type from record. at org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:464) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitLeafNode(HiveJsonReader.java:418) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:224) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitStructNode(HiveJsonReader.java:335) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:229) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.parseStruct(HiveJsonReader.java:206) at org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:187) ... 19 more Caused by: org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException: Cannot access contents of TextNode as binary due to broken Base64 encoding: Unexpected end of base64-encoded String: base64 variant 'MIME-NO-LINEFEEDS' expects padding (one or more '=' characters) at the end. This Base64Variant might have been incorrectly configured at org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:67) at org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.getBinaryValue(TextNode.java:78) at org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.binaryValue(TextNode.java:89) at org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:458) ... 25 more {noformat} was: *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted boolean strings without considering case sensitivity, resulting in {{false}} for any non-standard strings. Now, this behavior is case-sensitive, resulting in {{false}} for all non-conforming strings. To follow SQL standards, any string other than "false" (case-insensitive) should be treated as {{{}true{}}}. There are also inconsistencies between outputs when using an INSERT statement (which follows SQL standards) and loading a JSON value directly. These inconsistencies need to be fixed to ensure uniform output. *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings, allowing users to directly load JSON text to achieve desired results. Post HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for any other format. It's important to note that INSERT statements work seamlessly as the values are serialized to base64 and then deserialized. Hive should be capable of detecting if a JSON node is base64-encoded or not; if not, it should default to deserializing as a raw string. +To enhance user experience and maintain SQL standard compliance, we need to address the following:+ # Ensure case-sensitive handling of boolean strings in JSON deserialization and resolve inconsistencies between INSERT statements and direct JSON loading. # Implement automatic detection and appropriate deserialization of JSON nodes, distinguishing between base64-encoded and raw strings. > Fix issues in JSON SerDe implementations related to Boolean, Binary data types > ------------------------------------------------------------------------------ > > Key: HIVE-28632 > URL: https://issues.apache.org/jira/browse/HIVE-28632 > Project: Hive > Issue Type: Bug > Security Level: Public(Viewable by anyone) > Reporter: Araika Singh > Assignee: Araika Singh > Priority: Major > Labels: pull-request-available > > *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted > boolean strings without considering case sensitivity, resulting in {{false}} > for any non-standard strings. Now, this behavior is case-sensitive, resulting > in {{false}} for all non-conforming strings. To follow SQL standards, any > string other than "false" (case-insensitive) should be treated as > {{{}true{}}}. There are also inconsistencies between outputs when using an > INSERT statement (which follows SQL standards) and loading a JSON value > directly. These inconsistencies need to be fixed to ensure uniform output. > *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings, > allowing users to directly load JSON text to achieve desired results. Post > HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for > any other format. It's important to note that INSERT statements work > seamlessly as the values are serialized to base64 and then deserialized. Hive > should be capable of detecting if a JSON node is base64-encoded or not; if > not, it should default to deserializing as a raw string. > +To enhance user experience and maintain SQL standard compliance, we need to > address the following:+ > # Ensure case-sensitive handling of boolean strings in JSON deserialization > and resolve inconsistencies between INSERT statements and direct JSON loading. > # Implement automatic detection and appropriate deserialization of JSON > nodes, distinguishing between base64-encoded and raw strings. > Steps to replicate binary handling issue: > {code:java} > create table json_serde_table ( > binarycolumn1 binary, > binarycolumn2 binary, > binarycolumn3 binary, > binarycolumn4 binary, > binarycolumn5 binary, > binarycolumn6 binary > ) > row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'; > LOAD DATA INPATH '/path/to/jsonserde.txt' INTO TABLE json_serde_table; > select * from json_serde_table; {code} > {noformat} > Error stack trace: > Failed with exception > java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type from record. java.io.IOException: > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type from record. at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535) > at org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194) > at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95) at > org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212) at > org.apache.hadoop.hive.ql.Driver.run(Driver.java:154) at > org.apache.hadoop.hive.ql.Driver.run(Driver.java:149) at > org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) at > org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236) > at > org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336) > at > java.base/java.security.AccessController.doPrivileged(AccessController.java:712) > at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at > org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:840) Caused by: > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type from record. at > org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:191) at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:609) > ... 18 more Caused by: org.apache.hadoop.hive.serde2.SerDeException: Error > generating JSON binary type from record. at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:464) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitLeafNode(HiveJsonReader.java:418) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:224) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitStructNode(HiveJsonReader.java:335) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:229) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.parseStruct(HiveJsonReader.java:206) > at org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:187) > ... 19 more Caused by: > org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException: > Cannot access contents of TextNode as binary due to broken Base64 encoding: > Unexpected end of base64-encoded String: base64 variant 'MIME-NO-LINEFEEDS' > expects padding (one or more '=' characters) at the end. This Base64Variant > might have been incorrectly configured at > org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:67) > at > org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.getBinaryValue(TextNode.java:78) > at > org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.binaryValue(TextNode.java:89) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:458) > ... 25 more > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)