[ 
https://issues.apache.org/jira/browse/HIVE-28632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901720#comment-17901720
 ] 

Denys Kuzmenko commented on HIVE-28632:
---------------------------------------

Merged to master
Thanks for the patch [~araika]!

> Fix issues in JSON SerDe implementations related to Boolean, Binary data types
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-28632
>                 URL: https://issues.apache.org/jira/browse/HIVE-28632
>             Project: Hive
>          Issue Type: Bug
>      Security Level: Public(Viewable by anyone) 
>            Reporter: Araika Singh
>            Assignee: Araika Singh
>            Priority: Major
>              Labels: pull-request-available
>
> *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted 
> boolean strings without considering case sensitivity, resulting in {{false}} 
> for any non-standard strings. Now, this behavior is case-sensitive, resulting 
> in {{false}} for all non-conforming strings. To follow SQL standards, any 
> string other than "false" (case-insensitive) should be treated as 
> {{{}true{}}}. There are also inconsistencies between outputs when using an 
> INSERT statement (which follows SQL standards) and loading a JSON value 
> directly. These inconsistencies need to be fixed to ensure uniform output.
> *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings, 
> allowing users to directly load JSON text to achieve desired results. Post 
> HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for 
> any other format. It's important to note that INSERT statements work 
> seamlessly as the values are serialized to base64 and then deserialized. Hive 
> should be capable of detecting if a JSON node is base64-encoded or not; if 
> not, it should default to deserializing as a raw string.
> *+To enhance user experience and maintain SQL standard compliance, we need to 
> address the following:+*
>  # Ensure case-sensitive handling of boolean strings in JSON deserialization 
> and resolve inconsistencies between INSERT statements and direct JSON loading.
>  # Implement automatic detection and appropriate deserialization of JSON 
> nodes, distinguishing between base64-encoded and raw strings.
> *+Steps to replicate boolean handling issue:+*
> {code:java}
> // values for json: "booleancaseinsensitive" : "TrUE", "booleanstring" : 
> "true", "booleanboolean" : true, "stringfalse" : "FaLSE", "somestring" : 
> "somestringhere", "booleannull" : null, "booleannumfalse" : 0, 
> "booleannumtrue" : -1
> create table json_serde_bool_table (
>     booleancaseinsensitive boolean,
>     booleanstring boolean,
>     booleanboolean boolean,
>     stringfalse boolean,
>     somestring boolean,
>     booleannull boolean,
>     booleannumfalse boolean,
>     booleannumtrue boolean)
>   row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe';
> LOAD DATA INPATH '/warehouse/files/jsonserde.txt' INTO TABLE 
> json_serde_bool_table;
> select * from json_serde_bool_table;{code}
> +_Output for the above:_+
> {noformat}
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | booleancaseinsensitive  | booleanstring  | booleanboolean  | stringfalse  | 
> somestring  | booleannull  | booleannumfalse  | booleannumtrue  |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | false                   | true           | true            | false        | 
> false       | NULL         | false            | true            |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat}
> {code:java}
> truncate table json_serde_bool_table;
> INSERT INTO TABLE json_serde_bool_table VALUES ("TrUE", "true", true, 
> "FaLSE", "somestringhere", null, 0, -1);
> select * from json_serde_bool_table;{code}
> +_Output for the above:_+
> {noformat}
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | booleancaseinsensitive  | booleanstring  | booleanboolean  | stringfalse  | 
> somestring  | booleannull  | booleannumfalse  | booleannumtrue  |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | true                    | true           | true            | false        | 
> true        | NULL         | false            | true            |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat}
> *+Steps to replicate binary handling issue:+*
> {code:java}
> create table json_serde_table (
>   binarycolumn1 binary,
>   binarycolumn2 binary,
>   binarycolumn3 binary,
>   binarycolumn4 binary,
>   binarycolumn5 binary,
>   binarycolumn6 binary
> )
> row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe';
> LOAD DATA INPATH '/path/to/jsonserde.txt' INTO TABLE json_serde_table; // the 
> concerned json text show be normal text(not base64)
> select * from json_serde_table; {code}
> _+Error stack trace:+_
> {code:java}
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary 
> type
> from record.
> java.io.IOException:
> org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary 
> type
> from record.
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535)
> at
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194)
> at
> org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
> at
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336)
> at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused
> by: org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary 
> type
> from record.
> at
> org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:191)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:609)
> ... 18 more
> Caused
> by: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary
> type from record.
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:464)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitLeafNode(HiveJsonReader.java:418)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:224)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitStructNode(HiveJsonReader.java:335)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:229)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.parseStruct(HiveJsonReader.java:206)
> at
> org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:187)
> ... 19 more
> Caused
> by: org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException:
> Cannot access contents of TextNode as binary due to broken Base64 encoding:
> Unexpected end of base64-encoded String: base64 variant 'MIME-NO-LINEFEEDS'
> expects padding (one or more '=' characters) at the end. This Base64Variant
> might have been incorrectly configured
> at
> org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:67)
> at
> org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.getBinaryValue(TextNode.java:78)
> at
> org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.binaryValue(TextNode.java:89)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:458)
> ... 25 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to