Re: Review Request 32499: HIVE-10086: Hive throws error when accessing Parquet file schema using field name match

Ryan Blue Thu, 26 Mar 2015 11:17:59 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32499/#review77918
-----------------------------------------------------------


Ship it!


I found a few minor things, but this all looks correct to me. Nice job!


ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126257>

    Minor: I initially wondered why you didn't use `GroupType#getType(String 
fieldName)` but I see you're using `equalsIgnoreCase`. This would be more clear 
if you used a helper method, `getFieldIgnoreCase(String fieldName)`.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126258>

    Rather than cast, use `fieldType.asGroupType()` because it will throw a 
consistent error message.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126256>

    This is a great java trick I didn't know about to get an each-with-index 
method!
    
    Minor: Is there a better name than `i$` though? Typically names with 
dollar-signs signal generated code to me. What about `iterWithIndex` or 
something a bit more readable?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126249>

    I recommend not using the Type constructors because I'd like to deprecate 
them in the near future. This could be:
    
    ```java
    schemaTypes.add(Types.optional(BINARY).named(colName));
    ```
    
    There are other places this could be done, too, but it's up to you whether 
to file a follow-up or fix it now.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126250>

    Not a blocker, but I'd really like to know what this branching indicates 
about the table. Missing DDL?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126253>

    This gets the columns without changing the order, and the selected columns 
are the first N where N is the size of the list of names. So the only effect of 
this line is to shorten the schema to just what is defined in the table? In 
that case is it necessary to do this or can we just pass the table schema to 
the projection call later? Assuming the projected ids are always `< 
columnNamesList.size()` then it should do the same thing.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
<https://reviews.apache.org/r/32499/#comment126251>

    This isn't a blocker, but I find it odd that the "HIVE_TABLE_SCHEMA" isn't 
a Hive schema. It's a Parquet schema. It might be too late to rename the 
constant's value, but renaming the variable might help readability.



ql/src/test/queries/clientpositive/parquet_table_with_subschema.q
<https://reviews.apache.org/r/32499/#comment126261>

    Could you reverse the order of zip and Street? I want to make sure 
reordering is specifically tested.


- Ryan Blue


On March 25, 2015, 3:42 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32499/
> -----------------------------------------------------------
> 
> (Updated March 25, 2015, 3:42 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-10086
>     https://issues.apache.org/jira/browse/HIVE-10086
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Attached is the patch that handles schema that do not match between Parquet 
> and Hive.
> 
> The access to Parquet data is with name matching in this case. The table 
> column may have different schema order, but if the name matches the parquet 
> column name, then the value is retrieved.
> 
> Also, if the Hive schema has columns and struct elements that do not match 
> with the Parquet schema, then it will return NULL values instead.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  57ae7a9740d55b407cadfc8bc030593b29f90700 
>   ql/src/test/queries/clientpositive/parquet_schema_evolution.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_table_with_subschema.q 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_schema_evolution.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_table_with_subschema.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/32499/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 32499: HIVE-10086: Hive throws error when accessing Parquet file schema using field name match

Reply via email to