[ https://issues.apache.org/jira/browse/HIVE-28632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901720#comment-17901720 ]
Denys Kuzmenko commented on HIVE-28632: --------------------------------------- Merged to master Thanks for the patch [~araika]! > Fix issues in JSON SerDe implementations related to Boolean, Binary data types > ------------------------------------------------------------------------------ > > Key: HIVE-28632 > URL: https://issues.apache.org/jira/browse/HIVE-28632 > Project: Hive > Issue Type: Bug > Security Level: Public(Viewable by anyone) > Reporter: Araika Singh > Assignee: Araika Singh > Priority: Major > Labels: pull-request-available > > *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted > boolean strings without considering case sensitivity, resulting in {{false}} > for any non-standard strings. Now, this behavior is case-sensitive, resulting > in {{false}} for all non-conforming strings. To follow SQL standards, any > string other than "false" (case-insensitive) should be treated as > {{{}true{}}}. There are also inconsistencies between outputs when using an > INSERT statement (which follows SQL standards) and loading a JSON value > directly. These inconsistencies need to be fixed to ensure uniform output. > *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings, > allowing users to directly load JSON text to achieve desired results. Post > HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for > any other format. It's important to note that INSERT statements work > seamlessly as the values are serialized to base64 and then deserialized. Hive > should be capable of detecting if a JSON node is base64-encoded or not; if > not, it should default to deserializing as a raw string. > *+To enhance user experience and maintain SQL standard compliance, we need to > address the following:+* > # Ensure case-sensitive handling of boolean strings in JSON deserialization > and resolve inconsistencies between INSERT statements and direct JSON loading. > # Implement automatic detection and appropriate deserialization of JSON > nodes, distinguishing between base64-encoded and raw strings. > *+Steps to replicate boolean handling issue:+* > {code:java} > // values for json: "booleancaseinsensitive" : "TrUE", "booleanstring" : > "true", "booleanboolean" : true, "stringfalse" : "FaLSE", "somestring" : > "somestringhere", "booleannull" : null, "booleannumfalse" : 0, > "booleannumtrue" : -1 > create table json_serde_bool_table ( > booleancaseinsensitive boolean, > booleanstring boolean, > booleanboolean boolean, > stringfalse boolean, > somestring boolean, > booleannull boolean, > booleannumfalse boolean, > booleannumtrue boolean) > row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'; > LOAD DATA INPATH '/warehouse/files/jsonserde.txt' INTO TABLE > json_serde_bool_table; > select * from json_serde_bool_table;{code} > +_Output for the above:_+ > {noformat} > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+ > | booleancaseinsensitive | booleanstring | booleanboolean | stringfalse | > somestring | booleannull | booleannumfalse | booleannumtrue | > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+ > | false | true | true | false | > false | NULL | false | true | > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat} > {code:java} > truncate table json_serde_bool_table; > INSERT INTO TABLE json_serde_bool_table VALUES ("TrUE", "true", true, > "FaLSE", "somestringhere", null, 0, -1); > select * from json_serde_bool_table;{code} > +_Output for the above:_+ > {noformat} > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+ > | booleancaseinsensitive | booleanstring | booleanboolean | stringfalse | > somestring | booleannull | booleannumfalse | booleannumtrue | > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+ > | true | true | true | false | > true | NULL | false | true | > +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat} > *+Steps to replicate binary handling issue:+* > {code:java} > create table json_serde_table ( > binarycolumn1 binary, > binarycolumn2 binary, > binarycolumn3 binary, > binarycolumn4 binary, > binarycolumn5 binary, > binarycolumn6 binary > ) > row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'; > LOAD DATA INPATH '/path/to/jsonserde.txt' INTO TABLE json_serde_table; // the > concerned json text show be normal text(not base64) > select * from json_serde_table; {code} > _+Error stack trace:+_ > {code:java} > Failed with exception > java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type > from record. > java.io.IOException: > org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type > from record. > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535) > at > org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194) > at > org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149) > at > org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185) > at > org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236) > at > org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336) > at > java.base/java.security.AccessController.doPrivileged(AccessController.java:712) > at java.base/javax.security.auth.Subject.doAs(Subject.java:439) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at > org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > at > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:840) > Caused > by: org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type > from record. > at > org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:191) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:609) > ... 18 more > Caused > by: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary > type from record. > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:464) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitLeafNode(HiveJsonReader.java:418) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:224) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitStructNode(HiveJsonReader.java:335) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:229) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.parseStruct(HiveJsonReader.java:206) > at > org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:187) > ... 19 more > Caused > by: org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException: > Cannot access contents of TextNode as binary due to broken Base64 encoding: > Unexpected end of base64-encoded String: base64 variant 'MIME-NO-LINEFEEDS' > expects padding (one or more '=' characters) at the end. This Base64Variant > might have been incorrectly configured > at > org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:67) > at > org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.getBinaryValue(TextNode.java:78) > at > org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.binaryValue(TextNode.java:89) > at > org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:458) > ... 25 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)