Hi,

I'd like to ascertain if it might be possible to add 'update' and 'delete'
operations to the hive-hcatalog-streaming API. I've been looking at the API
with interest for the last week as it appears to have the potential to help
with some general data processing patterns that are prevalent where I work.
Ultimately, we continuously load large amounts of data into Hadoop which is
partitioned by some time interval - usually hour, day, or month depending
on the data size. However, the records that reside in this data can change.
We often receive some new information that mutates part of an existing
record already stored in a partition in HDFS. Typically the amount of
mutations is very small compared to the number of records in each
partitions.

To handle this currently we re-read and re-write all partitions that could
potentially be affected by new data. In practice a single hour's worth of
new data can require the reading and writing of 1 month's worth of
partitions. By storing the data in a transactional Hive table I believe
that we can instead issue updates and deletes for only the affected rows.
Although we do use Hive for analytics on this data, much of the processing
that generates and consumes the data is performed using Cascading.
Therefore I'd like to be able to read and write the data via an API which
we'd aim to integrate into a Cascading Tap of some description. Our
Cascading processes could determine the new, updated, and deleted records
and then use the API to stream these changes to the transactional Hive
table.

We have most of this working in a proof of concept, but as
hive-hcatalog-streaming does not expose the delete/update methods of the
OrcRecordUpdater we've had to hack together something unpleasant based on
the original API.

As a first step I'd like to check if there is any appetite for adding such
functionality to the API or if this goes against the original motivations
of the project? If this suggestion sounds reasonable then I'd be keen to
help move this forward.

Thanks - Elliot.

Reply via email to