[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767202#comment-16767202 ]
BELUGA BEHR commented on HIVE-21240: ------------------------------------ [~kgyrtkirk] Thank you for the review!! # I think that it's best to let the JSON library, which is specialized in parsing, do the work. It certainly cuts down on code that Hive needs to maintain. Yes, the tree has to be built upfront, but as things currently stand, there are very few scenarios where the entire tree isn't traversed. I guess that if there is an "unknown" field in the text, the field is ignored and the JSON string value is not parsed into its data type as things currently stand. I don't know that this is a scenario needs to be optimized for. I have not seen many situations where a customer actively wants to ignore a field. # I think it's best to work with Java {{Collections}} over Java native arrays. The only thing that happens with the results of the SerDe is that they are iterated over. The Qtest passing bear this out, so I think starting with 4.0 it's a good time to make that change. # I am sorry about the reformatted lines. Since I was touching a lot of code in the JsonSerde, I thought it may be helpful to clean up some check-style issues while I'm in there. I will revert. Thank you for pointing me at the HIve formatter, I have been using the Hadoop formatter for a year+. Thanks again! > JSON SerDe Deserialize Re-Write > ------------------------------- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Affects Versions: 4.0.0, 3.1.1 > Reporter: BELUGA BEHR > Assignee: BELUGA BEHR > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)