Re: Review Request 32499: HIVE-10086: Hive throws error when accessing Parquet file schema using field name match

Sergio Pena Thu, 26 Mar 2015 13:22:05 -0700


> On March 26, 2015, 6:17 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java,
> >  line 214
> > <https://reviews.apache.org/r/32499/diff/1/?file=906071#file906071line214>
> >
> >     This gets the columns without changing the order, and the selected 
> > columns are the first N where N is the size of the list of names. So the 
> > only effect of this line is to shorten the schema to just what is defined 
> > in the table? In that case is it necessary to do this or can we just pass 
> > the table schema to the projection call later? Assuming the projected ids 
> > are always `< columnNamesList.size()` then it should do the same thing.


What I understood about the parquet.column.index.access variable is that it is 
used when table column names do not match with the parquet file schema. So, 
users use this index access to access the column by index. This is different of 
the order issue.

See parquet_columnar.q test about how it is used.


> On March 26, 2015, 6:17 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java,
> >  line 222
> > <https://reviews.apache.org/r/32499/diff/1/?file=906071#file906071line222>
> >
> >     This isn't a blocker, but I find it odd that the "HIVE_TABLE_SCHEMA" 
> > isn't a Hive schema. It's a Parquet schema. It might be too late to rename 
> > the constant's value, but renaming the variable might help readability.

Thanks. I renamed the variable as it is used only when checking the table 
schema on DataWritableRecordConverter.java


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32499/#review77918
-----------------------------------------------------------


On March 25, 2015, 10:42 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32499/
> -----------------------------------------------------------
> 
> (Updated March 25, 2015, 10:42 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-10086
>     https://issues.apache.org/jira/browse/HIVE-10086
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Attached is the patch that handles schema that do not match between Parquet 
> and Hive.
> 
> The access to Parquet data is with name matching in this case. The table 
> column may have different schema order, but if the name matches the parquet 
> column name, then the value is retrieved.
> 
> Also, if the Hive schema has columns and struct elements that do not match 
> with the Parquet schema, then it will return NULL values instead.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  57ae7a9740d55b407cadfc8bc030593b29f90700 
>   ql/src/test/queries/clientpositive/parquet_schema_evolution.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_table_with_subschema.q 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_schema_evolution.q.out 
> PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_table_with_subschema.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/32499/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 32499: HIVE-10086: Hive throws error when accessing Parquet file schema using field name match

Reply via email to