I've been looking at the structure of the ORCFiles that back transaction tables in Hive. After a compaction I was surprised to find that the base file structure is identical to the delta structure:
struct< operation:int, originalTransaction:bigint, bucket:int, rowId:bigint, currentTransaction:bigint, row:struct< // row fields > > This raises a few questions: - How should I interpret the operation and originalTransaction values in these compacted rows? - Are the values in the operation and originalTransaction fields required for the application of later deltas? - Does this structure in anyway inhibit the ability to perform partial reads of the row data (i.e. specific columns)? - How does this structure relate to the RecordIdentier class which contains only a subset of the meta-data fields, and the AcidInputFormat.Options.recordIdColumn() which seems to imply a meta data column that is alongside the row columns, not the nested structure that we see in practice. I suppose that I might find the answers to some of these myself by simply reading in the data with the appropriate input format, which leads me to my final question: is there already a input format available that will seamlessly and transparently apply any deltas on read (for consuming the data in an M/R job for example). Apologies for so many questions. Thanks - Elliot.