Elliot West created HIVE-11092:
----------------------------------

             Summary: First delta of an ORC ACID table contains incorrect schema
                 Key: HIVE-11092
                 URL: https://issues.apache.org/jira/browse/HIVE-11092
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: Elliot West
            Assignee: Elliot West
            Priority: Minor


I've been reading ORC ACID data that backs transactional tables from a process 
external to Hive. Initially I tried to use 'schema on read' but found some 
inconsistencies in the schema returned from the initial delta file and 
subsequent delta and base files. To reproduce the issue by example:

{code}
CREATE TABLE base_table ( id int, message string )
  PARTITIONED BY ( continent string, country string )
  CLUSTERED BY (id) INTO 1 BUCKETS
  STORED AS ORC
  TBLPROPERTIES ('transactional' = 'true');
  
INSERT INTO TABLE base_table PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');

UPDATE base_table SET message = 'updated' WHERE id = 1;
{code}

Now examining the raw data with the {{orcfiledump}} utility (edited for 
brevity):
{code}
cd hive/warehouse/base_table/continent=Asia/country=India/

hive --orcfiledump delta_0000001_0000001/bucket_00000
Type: 
struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<_col0:int,_col1:string>>
    
        
hive --orcfiledump delta_0000002_0000002/bucket_00000
Type: 
struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<id:int,message:string>>
    
{code}

The row schema for the first delta that resulted from the inserts has its field 
names erased: {{row:struct<_col0:int,_col1:string>}}, whereas the delta for the 
update reports the correct schema: {{row:struct<id:int,message:string>}}. I 
have also checked this with my own reader code so am confident that 
{{FileDump}} is not at fault.

I believe that the row field names, and hence schema, should be consistent 
across all ORC files in the ACID data set. This will enable schema on read with 
field access by name (not index), which is currently not possible. Therefore 
I'd like to get this issue resolved.

I'm happy to work on this, however after working through {{OrcRecordUpdater}} 
and {{FileSinkOperator}} and related tests I've failed to reproduce or isolate 
the issue at a smaller scale. I'd be grateful for some suggestions on where to 
look next.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to