[ 
https://issues.apache.org/jira/browse/HIVE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046043#comment-13046043
 ] 

Krishna Kumar commented on HIVE-956:
------------------------------------

Re-ordering my responses;

bq. can LazyBinaryColumnarSerDe share some code with LazyBinarySerde?

        Yes, it does. For instance, the serialization of 
LazyBinaryColumnarSerde forwards to LazyBinarySerde.serialize for each field 
(except for one special case of empty string described below). Similarly the 
deserialization happens in both cases eventually via LazyBinaryXXX.init. The 
objectinspector is the same, while the common parts of the object class 
(ColumnarStruct and BinaryColumnarStruct) has been refactored into a 
ColumnarStructBase class.
        
bq. how warnedOnceNullMapKey is used?

        To enable the above (reuse of LazyBinarySerde.serialize()), I have made 
it a static method of LazyBinarySerde. The only member variable which was used 
in that method was a boolean which was used to issue a LOG warning the first 
time a null map key is encountered. So, I made that into a parameter so that 
the existing behaviour for LazyBinarySerde is unchanged (that is a warning is 
issued once per instance).
        
bq. why do you need to handle empty string specially? 
"serializeStream.write(INVALID_UTF__SINGLE_BYTE, 0, 1);" i thought for empty 
data, we just store data length 0 in rcfile. 
bq. I thought since there is no NULLSequence in the new serde, the null should 
be handled specially.  i am missing sth, How do you handle null here?    

        As you know, column values' length are stored in the key part of rcfile 
(after run-length encoding, and an optional compressing). A 0 in this recorded 
length is used as the null indicator. This means that non-null values should 
occupy one or more bytes when serialized. That was ok with the original 
LazyBinarySerde.serialize, as primitive numeric types, strings (with their 
datalength) and complex types (with their datalength) do occupy non-zero bytes. 
But this is too much redundancy and overhead for the typical case (non-empty 
strings), so I added an extra parameter "skipLengthPrefix" which skips 
prefixing string/list/map/struct types with a length prefix. But with this, 
empty strings become a problem since they need to differentiated from nulls. So 
I used this special single-byte marker for denoting empty strings. (As a side 
note, for completeness' sake, I should point out that an instance of a struct 
which has no fields will be encoded with zero bytes. But this is not allowed by 
the language so I think we are fine here.)

bq. is getLength() only for null check? if yes, can you call it 'isNull()'? And 
if the only difference in ColumnarStruct and BinaryColumnarStruct is null 
check, just curious, how difficult is it to avoid this new BinaryColumnarStruct 
class?

        See above. In general, the length recorded in the key part of rcfile 
reflects the length of the bytesequence with which the lazyobject should be 
initialized. The only exception is in the case of empty strings, where the 
recorded length is 1 (the special marker), but the lazyobject needs be 
initialized with a 0-length byte sequence.   
        
        Recorded length being 0 indicates nulls for lazybinarycolumnar and data 
being the nullsequence indicates null for lazybinary. The difference between 
ColumnarStruct and BinaryColumnarStruct is this length/null handling, and the 
object creation itself, which are now the abstract methods of the common base 
class.

bq. originally the map comparison is not supported, but this patch added a 
mapEqualComparer. can we put this in a separate patch? It seems the logic in 
CrossMapEqualComparer is not correct. (how do you make sure you will get the 
keys from a map in some kind of same order?)

        I put this in this patch since I needed that for the tests that I had 
added. Do you think I should create a dependant jira and extract this part of 
the patch to that jira?
        
        Hmm, the logic in crossmapequalcomparer looks ok to me (given the 
caveats mentioned in the javadoc about broken transitivity of 
greater-than/less-than.) I am not accessing the keys in tandem, but in a nested 
loop. Since the number of keys are the same, and the keys are unique, both keys 
and values matching (as declared by ObjectInspectorUtils.compare) is taken as a 
match for that pair of key-value pairs.
        
bq. can you add a 'toString()' for new binary columnar serde, just the same as 
columnar serde

        Done, and patch regenerated after rebasing.

> Add support of columnar binary serde
> ------------------------------------
>
>                 Key: HIVE-956
>                 URL: https://issues.apache.org/jira/browse/HIVE-956
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: Krishna Kumar
>         Attachments: HIVE.956.patch.0, HIVE.956.patch.1, HIVE.956.patch.2
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to