[ 
https://issues.apache.org/jira/browse/HIVE-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978476#comment-14978476
 ] 

Elliot West commented on HIVE-11092:
------------------------------------

Fixed by HIVE-4243 apparently. I'll confirm and close.

> First delta of an ORC ACID table contains non-descriptive schema
> ----------------------------------------------------------------
>
>                 Key: HIVE-11092
>                 URL: https://issues.apache.org/jira/browse/HIVE-11092
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Elliot West
>            Assignee: Elliot West
>            Priority: Minor
>              Labels: orc, orcfile, transaction, transactions
>
> I've been reading ORC ACID data that backs transactional tables from a 
> process external to Hive. Initially I tried to use 'schema on read' but found 
> some inconsistencies in the schema returned from the initial delta file and 
> subsequent delta and base files. To reproduce the issue by example:
> {code}
> CREATE TABLE base_table ( id int, message string )
>   PARTITIONED BY ( continent string, country string )
>   CLUSTERED BY (id) INTO 1 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES ('transactional' = 'true');
>   
> INSERT INTO TABLE base_table PARTITION (continent = 'Asia', country = 'India')
> VALUES (1, 'x'), (2, 'y'), (3, 'z');
> UPDATE base_table SET message = 'updated' WHERE id = 1;
> {code}
> Now examining the raw data with the {{orcfiledump}} utility (edited for 
> brevity):
> {code}
> cd hive/warehouse/base_table/continent=Asia/country=India/
> hive --orcfiledump delta_0000001_0000001/bucket_00000
> Type: 
> struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<_col0:int,_col1:string>>
>     
>         
> hive --orcfiledump delta_0000002_0000002/bucket_00000
> Type: 
> struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<id:int,message:string>>
>     
> {code}
> The row schema for the first delta that resulted from the inserts has its 
> field names erased: {{row:struct<_col0:int,_col1:string>}}, whereas the delta 
> for the update reports the correct schema: 
> {{row:struct<id:int,message:string>}}. I have also checked this with my own 
> reader code so am confident that {{FileDump}} is not at fault.
> I believe that the row field names, and hence schema, should be consistent 
> across all ORC files in the ACID data set. This will enable schema on read 
> with field access by name (not index), which is currently not possible. 
> Therefore I'd like to get this issue resolved.
> I'm happy to work on this, however after working through {{OrcRecordUpdater}} 
> and {{FileSinkOperator}} and related tests I've failed to reproduce or 
> isolate the issue at a smaller scale. I'd be grateful for some suggestions on 
> where to look next.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to