Are you using OrcInputFormat.getReader to get a reader? If so, it should take care of these anomalies for you and mask your need to worry about delta versus base files.

Alan.

Elliot West <mailto:tea...@gmail.com>
April 29, 2015 at 9:40
Hi,

I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and I've hit a couple of issues for a particular scenario. The case I have is when data has been written into a transactional table and a compaction has not yet occurred. This can be recreated like so:

    CREATE TABLE test_table ( id int, message string )
      PARTITIONED BY ( continent string, country string )
      CLUSTERED BY (id) INTO 1 BUCKETS
      STORED AS ORC
      TBLPROPERTIES ('transactional' = 'true')
    );

    INSERT INTO TABLE test_table
    PARTITION (continent = 'Asia', country = 'India')
    VALUES (1, 'x'), (2, 'y'), (3, 'z');


This results in a dataset that contains only a delta file:

    
warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000


I'm assuming that this scenario is valid - a user might insert new data into a table and want to read it back at a time prior to the first compaction. I can select the data back from this table in Hive with no problem. However, for a number of reasons I'm finding it rather tricky to do so programmatically. At this point I should mention that reading base files or base+deltas is trouble free. The issues I've encountered are as follows:

 1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
    ReaderOptions) fails if the directory specified by the path
    ('warehouse/test_table/continent=Asia/country=India' in this case)
    contains only a delta. Specifically it attempts to access
    'delta_0000060_0000060' as if it were a file and therefore fails.
    It appears to function correctly if the directory also contains a
    base. We use this method to extract the typeInfo from the ORCFile
    and build a mapping between the user's declared fields.
 2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
    inconsistent in that it returns the path of the base if present,
    otherwise the parent. This presents issues within cascading (and I
    assume other frameworks) that expect the paths returned by splits
    to be at the same depth and for them to contain some kind of
    'part' file leaf. In my example the path returned is
    'warehouse/test_table/continent=Asia/country=India', if I had also
    had a base I'd have seen
    'warehouse/test_table/continent=Asia/country=India/base_0000006'.
 3. The footers of the delta files do not contain the true field names
    of the table. In my example I see '_col0:int,_col1:string' where
    I'd expect 'id:int,message:string'. A base file, if present
    correctly declares the field names. We chose to access values by
    field name rather than position so that users of our reader do not
    need to declare the full schema to read partial data, however this
    behaviour trips this up.

I have (horrifically :) worked around issues 1 and 2 in my own code and have some ideas to circumvent 3 but I wanted to get a feeling as to whether I'm going against the tide and if my life might be easier if I approached this another way.

Thanks - Elliot.


Reply via email to