Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Ryan Blue Tue, 11 Nov 2014 15:03:12 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27404/#review60863
-----------------------------------------------------------



Great job! I think this is really close. I found one bug and some minor stuff, 
all of which is noted below. Other than those, I have two major comments:

First, the organization seems to duplicate code from writeFields in writeMap 
and writeArray, where each field of the inner group is written. As it is, this 
duplicates the code from writeFields and writeFields could be used instead. 
This also deviates from what the other writer implementations do, which is to 
require the schema to match a certain pattern and write exactly that pattern. 
See [1] for an example of what I'm talking about. But I think this was actually 
structured this way to work with multiple write schemas...

The second problem is that this works with multiple write schemas. This was 
probably a mis-communication on my part, but the backward-compatibility rules 
in PARQUET-113 are for the read side, not the write side. Hive is already 
producing the correct structure when writing lists and maps, so we don't need 
to change that. The problem that PARQUET-113 addresses is reading data written 
by parquet-avro or parquet-thrift (or similar variants) that uses the 2-level 
representation. We still need to fix compatibility with that data, but it 
probably should be done as a separate issue.

[1] - 
https://github.com/rdblue/incubator-parquet-mr/blob/PARQUET-113-nested-types-avro/parquet-avro/src/main/java/parquet/avro/AvroWriteSupport.java#L118


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment102276>

    Minor: The size of map_values should be accessed with map_values.length, 
rather than value.get().length
    
    Nit: I'm not sure what variable name conventions are for this project, but 
java projects usually prefer camelCase, as in mapValues and mapKeyValue



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment102277>

    When would key_pair_value be null? If this is something Hive needs support 
for, then we should note it with a comment. Otherwise, the violation of this 
assumption should be an exception because Parquet can't handle a null value 
here. This differs slightly from the behavior for LIST because the list will 
have null values encoded here. For a map, however, the key-value pair is 
required and should always be non-null; just the map value could be null, which 
is inside this layer.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment102267>

    This doesn't appear to be used in the test. Was this extra group in the 
record intended to test the fact that Hive ignores extra columns? If so, then 
can you note it as a comment?



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/27404/#comment102291>

    This test should also verify that null array values are supported by adding 
a createNull() call in between 1 and 2.
    
    I tried this and the test failed because the null check calls continue, 
which skips to the next iteration in writeArray and never calls endGroup(). 
Adding endGroup() there fixes it; good news!


- Ryan Blue


On Nov. 10, 2014, 11:04 a.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27404/
> -----------------------------------------------------------
> 
> (Updated Nov. 10, 2014, 11:04 a.m.)
> 
> 
> Review request for hive, Ryan Blue and Brock Noland.
> 
> 
> Bugs: HIVE-8359
>     https://issues.apache.org/jira/browse/HIVE-8359
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch changes the way DataWritableWriter class writes an array of 
> elements to the Parquet record. 
> It wraps each array record into a new startGroup/endGroup block so that 
> Parquet can detect null values on those optional fields.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c7078efe27482df0a11dd68ac068da27dbcf51b3 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_map_null.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_map_null.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27404/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Reply via email to