Hi Alan, Regarding the streaming changes, I've raised an issue and submitted patches here: https://issues.apache.org/jira/browse/HIVE-10165
Thanks - Elliot. On 26 March 2015 at 23:20, Alan Gates <alanfga...@gmail.com> wrote: > > > Elliot West <tea...@gmail.com> > March 26, 2015 at 15:58 > Hi Alan, > > Yes, this is precisely our situation. The issues I'm having with the > current API are that I cannot intercept the creation of the > OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options > instance. Additionally, I cannot extend the TransactionBatch interface to > expose further operations and instead have to encode the operation type in > the row parameter and decode it in the RecordWriter implementation - > possible but not very elegant. > > Yes, you'd have to make changes in the API itself to make this possible, > which is fine. I'm happy to review patches for this. > > > Regarding the merge functionality, is this a new suggested feature of > Hive? Is there a JIRA I could track? > > Not yet, it's just discussions I've had with users like you who'd be > interested in merge, though from the SQL side rather than streaming. I am > not aware of anyone working on it at the moment. > > Alan. > > > Thanks - Elliot. > > > On Thursday, March 26, 2015, Alan Gates <alanfga...@gmail.com> wrote: > Alan Gates <alanfga...@gmail.com> > March 26, 2015 at 15:30 > Are you saying that when the records arrive you don't know updates from > inserts and you're already doing processing to determine that? If so, this > is exactly the case we'd like to hit with the merge functionality. > > If you're already scanning the existing ORC file and obtaining the unique > identifier (which is the triple (rowid, txnid, bucketid)) and determining > which records have changed then you could certainly change the streaming > interface to enable passing the records through in bulk (along with > operation markers to indicate insert/update/delete) and write those in a > delta file in one pass. > > Alan. > > Elliot West <tea...@gmail.com> > March 26, 2015 at 15:10 > Hi, thanks for your quick reply. > > I see your point, but in my case would I not have the required > RecordIdentifiers available as I'm already reading the entire partition to > determine which records have changed? Admittedly Hive will not reveal > the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the > underlying ORCFile and therefore can be read by my Cascading process. In > this scenario (where I have the relevant recIds to hand) why would it then > be necessary to perform additional scans when issuing the deletes/updates? > > Apologies if I'm misunderstanding something. > > Thanks - Elliot. > > On Thursday, March 26, 2015, Alan Gates <alanfga...@gmail.com> wrote: > Alan Gates <alanfga...@gmail.com> > March 26, 2015 at 14:48 > The missing piece for adding update and delete to the streaming API is a > primary key. Updates and deletes in SQL work by scanning the table or > partition where the record resides. This is assumed to be ok since we are > not supporting transactional workloads and thus update/deletes are assumed > to be infrequent. But a need to scan for each update or delete will not > perform adequately in the streaming case. > > I've had a few discussions with others recently who are thinking of adding > merge like functionality, where you would upload all changes to a temp > table and then in one scan/transaction apply those changes. This is a > common way to handle these situations for data warehouses, and is much > easier than adding a primary key concept to Hive. > > Alan. > > Elliot West <tea...@gmail.com> > March 26, 2015 at 14:08 > Hi, > > I'd like to ascertain if it might be possible to add 'update' and 'delete' > operations to the hive-hcatalog-streaming API. I've been looking at the API > with interest for the last week as it appears to have the potential to help > with some general data processing patterns that are prevalent where I work. > Ultimately, we continuously load large amounts of data into Hadoop which is > partitioned by some time interval - usually hour, day, or month depending > on the data size. However, the records that reside in this data can change. > We often receive some new information that mutates part of an existing > record already stored in a partition in HDFS. Typically the amount of > mutations is very small compared to the number of records in each > partitions. > > To handle this currently we re-read and re-write all partitions that could > potentially be affected by new data. In practice a single hour's worth of > new data can require the reading and writing of 1 month's worth of > partitions. By storing the data in a transactional Hive table I believe > that we can instead issue updates and deletes for only the affected rows. > Although we do use Hive for analytics on this data, much of the processing > that generates and consumes the data is performed using Cascading. > Therefore I'd like to be able to read and write the data via an API which > we'd aim to integrate into a Cascading Tap of some description. Our > Cascading processes could determine the new, updated, and deleted records > and then use the API to stream these changes to the transactional Hive > table. > > We have most of this working in a proof of concept, but as > hive-hcatalog-streaming does not expose the delete/update methods of the > OrcRecordUpdater we've had to hack together something unpleasant based on > the original API. > > As a first step I'd like to check if there is any appetite for adding such > functionality to the API or if this goes against the original motivations > of the project? If this suggestion sounds reasonable then I'd be keen to > help move this forward. > > Thanks - Elliot. > > >