[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777446#comment-16777446 ]
BELUGA BEHR commented on HIVE-21240: ------------------------------------ [~kgyrtkirk] OK. I finally figured it out. cc: [~bslim] There are a few things going on: The Kafka Storage Handler q-test is incorrect. There are several tables that define a {{timestamp}} or a {{timestamp with local time zone}} column. And for every q-test result, the q-test expects a {{NULL}} value in the column. However, I do believe this is incorrect and therefore this q-test is testing for a wrong behavior. There is timestamp data in the test data set and there should be values present, not {{NULL}} values. I believe that the current JSON SerDe is unable to handle the timestamp Strings in the test data set and instead of throwing an Exception, it swallows the bad value and returns a null value. Check it out [here|https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1298]. I do not think this is a good behavior. Users may silently lose data and there is no way to configure it otherwise. However, even if it was configurable, the default should be to throw an Exception and to stop the processing, not to lose data. Additionally, there is a q-test that states: {quote}using basic implementation of flat json probably to be removed {quote} This is achieved by not specifying an explicit table SerDe in the q-test. The 'basic implementation' is probably in reference to the class {{KafkaJsonSerDe}}. However, this class is not used anymore as far I can tell and is no longer the default SerDe for the Kafka Storage Handler. The Hive {{JsonSerde}} is [the default serde|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaTableProperties.java#L38]. One thing you'll notice with this particular q-test is that there is also no {{timestamp.formats}} defined for the table. This is because {{KafkaJsonSerDe}} handles [ISO-8601 format|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaJsonSerDe.java#L230] (and only that format) so it does not need to explicitly specify the format. All of the data in this test are formatted as ISO-8601 and if you look at all the other tables in the test, they all pass in an ISO-8601 format string. It is required to pass this format string because the Hive {{JsonSerde}} does not handle that format by default. Without the {{timestamp.formats}} defined on this table, there is no way that the current {{JsonSerde}} is handling this data correctly. It again demonstrates that the current {{JsonSerde}} behavior is to swallow the exception, and return NULL. [https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/kafka_storage_handler.q] [https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/kafka/kafka_storage_handler.q.out] What triggered these q-test failures for this jira is three fold: # This JIRA's implementation was in some cases producing valid timestamps instead of NULL # This JIRA's implementation was throwing an exception because it was unable to parse the timestamp String without the format string defined # This implementation was in some cases throwing an exception because it did not support {{timestamp with local time zone}}. Moving forward, I would like to: # Add to this implementation the ability to process {{timestamp with local time zone}}. This is currently not fully supported in the current {{JsonSerde}} implementation. Only [timestamp|https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L355] is supported. It works in trunk because there's a conversion process that happens immediately prior to this _switch_ statement that is doing the work. # Update the {{kafka_storage_handler}} q-test to check for the correct timestamp values # Remove {{KafkaJsonSerDe}} serde # Remove the "basic implementation of flat json" test from the q-test > JSON SerDe Re-Write > ------------------- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Affects Versions: 4.0.0, 3.1.1 > Reporter: BELUGA BEHR > Assignee: BELUGA BEHR > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)