[ https://issues.apache.org/jira/browse/HIVE-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978476#comment-14978476 ]
Elliot West commented on HIVE-11092: ------------------------------------ Fixed by HIVE-4243 apparently. I'll confirm and close. > First delta of an ORC ACID table contains non-descriptive schema > ---------------------------------------------------------------- > > Key: HIVE-11092 > URL: https://issues.apache.org/jira/browse/HIVE-11092 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Elliot West > Assignee: Elliot West > Priority: Minor > Labels: orc, orcfile, transaction, transactions > > I've been reading ORC ACID data that backs transactional tables from a > process external to Hive. Initially I tried to use 'schema on read' but found > some inconsistencies in the schema returned from the initial delta file and > subsequent delta and base files. To reproduce the issue by example: > {code} > CREATE TABLE base_table ( id int, message string ) > PARTITIONED BY ( continent string, country string ) > CLUSTERED BY (id) INTO 1 BUCKETS > STORED AS ORC > TBLPROPERTIES ('transactional' = 'true'); > > INSERT INTO TABLE base_table PARTITION (continent = 'Asia', country = 'India') > VALUES (1, 'x'), (2, 'y'), (3, 'z'); > UPDATE base_table SET message = 'updated' WHERE id = 1; > {code} > Now examining the raw data with the {{orcfiledump}} utility (edited for > brevity): > {code} > cd hive/warehouse/base_table/continent=Asia/country=India/ > hive --orcfiledump delta_0000001_0000001/bucket_00000 > Type: > struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<_col0:int,_col1:string>> > > > hive --orcfiledump delta_0000002_0000002/bucket_00000 > Type: > struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<id:int,message:string>> > > {code} > The row schema for the first delta that resulted from the inserts has its > field names erased: {{row:struct<_col0:int,_col1:string>}}, whereas the delta > for the update reports the correct schema: > {{row:struct<id:int,message:string>}}. I have also checked this with my own > reader code so am confident that {{FileDump}} is not at fault. > I believe that the row field names, and hence schema, should be consistent > across all ORC files in the ACID data set. This will enable schema on read > with field access by name (not index), which is currently not possible. > Therefore I'd like to get this issue resolved. > I'm happy to work on this, however after working through {{OrcRecordUpdater}} > and {{FileSinkOperator}} and related tests I've failed to reproduce or > isolate the issue at a smaller scale. I'd be grateful for some suggestions on > where to look next. -- This message was sent by Atlassian JIRA (v6.3.4#6332)