Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Ryan Blue Sun, 02 Nov 2014 15:22:11 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27404/#review59546
-----------------------------------------------------------



This is looking good overall, but I have a few comments about how it works 
below. I'd also like to see unit tests for the writeArray and, if added, 
writeMap methods. That would be a good way to demonstrate the 
ArrayWritable-of-ArrayWritables structure this relies on and also expand into 
some more test cases.


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment100810>

    I think this patch would benefit from some slight refactoring. The 
startGroup/endGroup calls have been changed so that each key-value pair or 
array element has its own startGroup/endGroup call. That's needed, but it seems 
awkward that this makes the assumptions of writeData and writeArray not match 
-- writeData expects startGroup and endGroup to have been called already, while 
writeArray does not.
    
    Another problem is that this still doesn't check for LIST and MAP 
annotations. I think it would be much better to make this match the (draft) 
spec more closely, which would clean up the problem above. By looking for LIST 
or MAP, the call to writeArray could be moved to the level that wraps the 
repeated group. Then the startGroup/endGroup calls could be consistent. 
    
    This would also allow you to do more validation. First, you could check 
that the group contained by a LIST or MAP annotation actually is a repeated 
group and you could check whether LIST and MAP annotate the repeated group 
itself and adjust accordingly.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment100812>

    It would be much easier to read and understand this section if the names 
were more descriptive than subValue and subType. There are also validations 
that should be done at this level, like checking that the key-value type in a 
map is named "key_value", that the key is required, etc. It would probably be 
worth breaking writeArray and writeMap into separate methods that made 
assumptions about the structure of the type, validated those assumptions, and 
then wrote out values with more understandable code.


- Ryan Blue


On Oct. 30, 2014, 4:43 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27404/
> -----------------------------------------------------------
> 
> (Updated Oct. 30, 2014, 4:43 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8359
>     https://issues.apache.org/jira/browse/HIVE-8359
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch changes the way DataWritableWriter class writes an array of 
> elements to the Parquet record. 
> It wraps each array record into a new startGroup/endGroup block so that 
> Parquet can detect null values on those optional fields.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c7078efe27482df0a11dd68ac068da27dbcf51b3 
>   ql/src/test/queries/clientpositive/parquet_map_null.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_map_null.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27404/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Reply via email to