[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777446#comment-16777446
 ] 

BELUGA BEHR commented on HIVE-21240:
------------------------------------

[~kgyrtkirk] OK. I finally figured it out.

cc: [~bslim]

There are a few things going on:

The Kafka Storage Handler q-test is incorrect. There are several tables that 
define a {{timestamp}} or a {{timestamp with local time zone}} column. And for 
every q-test result, the q-test expects a {{NULL}} value in the column. 
However, I do believe this is incorrect and therefore this q-test is testing 
for a wrong behavior. There is timestamp data in the test data set and there 
should be values present, not {{NULL}} values.

I believe that the current JSON SerDe is unable to handle the timestamp Strings 
in the test data set and instead of throwing an Exception, it swallows the bad 
value and returns a null value. Check it out 
[here|https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1298].
 I do not think this is a good behavior. Users may silently lose data and there 
is no way to configure it otherwise. However, even if it was configurable, the 
default should be to throw an Exception and to stop the processing, not to lose 
data.

Additionally, there is a q-test that states:
{quote}using basic implementation of flat json probably to be removed
{quote}
This is achieved by not specifying an explicit table SerDe in the q-test. The 
'basic implementation' is probably in reference to the class 
{{KafkaJsonSerDe}}. However, this class is not used anymore as far I can tell 
and is no longer the default SerDe for the Kafka Storage Handler. The Hive 
{{JsonSerde}} is [the default 
serde|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaTableProperties.java#L38].
 One thing you'll notice with this particular q-test is that there is also no 
{{timestamp.formats}} defined for the table. This is because {{KafkaJsonSerDe}} 
handles [ISO-8601 
format|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaJsonSerDe.java#L230]
 (and only that format) so it does not need to explicitly specify the format.  
All of the data in this test are formatted as ISO-8601 and if you look at all 
the other tables in the test, they all pass in an ISO-8601 format string. It is 
required to pass this format string because the Hive {{JsonSerde}} does not 
handle that format by default. Without the {{timestamp.formats}} defined on 
this table, there is no way that the current {{JsonSerde}} is handling this 
data correctly.  It again demonstrates that the current {{JsonSerde}} behavior 
is to swallow the exception, and return NULL.

[https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/kafka_storage_handler.q]
 
[https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/kafka/kafka_storage_handler.q.out]

 

What triggered these q-test failures for this jira is three fold:
 # This JIRA's implementation was in some cases producing valid timestamps 
instead of NULL
 # This JIRA's implementation was throwing an exception because it was unable 
to parse the timestamp String without the format string defined
 # This implementation was in some cases throwing an exception because it did 
not support {{timestamp with local time zone}}.

 
 Moving forward, I would like to:
 # Add to this implementation the ability to process {{timestamp with local 
time zone}}. This is currently not fully supported in the current {{JsonSerde}} 
implementation. Only 
[timestamp|https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L355]
 is supported.  It works in trunk because there's a conversion process that 
happens immediately prior to this _switch_ statement that is doing the work.
 # Update the {{kafka_storage_handler}} q-test to check for the correct 
timestamp values
 # Remove {{KafkaJsonSerDe}} serde
 # Remove the "basic implementation of flat json" test from the q-test

> JSON SerDe Re-Write
> -------------------
>
>                 Key: HIVE-21240
>                 URL: https://issues.apache.org/jira/browse/HIVE-21240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 4.0.0, 3.1.1
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to