I've been looking at the structure of the ORCFiles that back transaction
tables in Hive. After a compaction I was surprised to find that the base
file structure is identical to the delta structure:

  struct<
    operation:int,
    originalTransaction:bigint,
    bucket:int,
    rowId:bigint,
    currentTransaction:bigint,
    row:struct<
      // row fields
    >
  >

This raises a few questions:

   - How should I interpret the operation and originalTransaction values in
   these compacted rows?
   - Are the values in the operation and originalTransaction fields
   required for the application of later deltas?
   - Does this structure in anyway inhibit the ability to perform partial
   reads of the row data (i.e. specific columns)?
   - How does this structure relate to the RecordIdentier class which
   contains only a subset of the meta-data fields, and the
   AcidInputFormat.Options.recordIdColumn() which seems to imply a meta
   data column that is alongside the row columns, not the nested structure
   that we see in practice.

I suppose that I might find the answers to some of these myself by simply
reading in the data with the appropriate input format, which leads me to my
final question: is there already a input format available that will
seamlessly and transparently apply any deltas on read (for consuming the
data in an M/R job for example).

Apologies for so many questions.

Thanks - Elliot.

Reply via email to