[ https://issues.apache.org/jira/browse/HIVE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825611#comment-13825611 ]
Edward Capriolo commented on HIVE-5317: --------------------------------------- I have two fundamental problems with this concept. {quote} The only requirement is that the file format must be able to support a rowid. With things like text and sequence file this can be done via a byte offset. {quote} This is a good reason not to do this. Things that only work for some formats create fragmentation. What about format's that do not have a row id? What if the user is already using the key for something else like data? {quote} Once an hour a log of transactions is exported from a RDBS and the fact tables need to be updated (up to 1m rows) to reflect the new data. The transactions are a combination of inserts, updates, and deletes. The table is partitioned and bucketed. {quote} What this ticket describes seems like a bad use case for hive. Why would the user not simply create a new table partitioned by hour? What is the need to transaction ally in-place update a table? It seems like the better solution would be for the user to log these updates themselves and then export the table with a tool like squoop periodically. I see this as a really complicated piece of work, for a narrow use case, and I have a very difficult time believing adding transactions to hive to support this is the right answer. > Implement insert, update, and delete in Hive with full ACID support > ------------------------------------------------------------------- > > Key: HIVE-5317 > URL: https://issues.apache.org/jira/browse/HIVE-5317 > Project: Hive > Issue Type: New Feature > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: InsertUpdatesinHive.pdf > > > Many customers want to be able to insert, update and delete rows from Hive > tables with full ACID support. The use cases are varied, but the form of the > queries that should be supported are: > * INSERT INTO tbl SELECT … > * INSERT INTO tbl VALUES ... > * UPDATE tbl SET … WHERE … > * DELETE FROM tbl WHERE … > * MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN > ... > * SET TRANSACTION LEVEL … > * BEGIN/END TRANSACTION > Use Cases > * Once an hour, a set of inserts and updates (up to 500k rows) for various > dimension tables (eg. customer, inventory, stores) needs to be processed. The > dimension tables have primary keys and are typically bucketed and sorted on > those keys. > * Once a day a small set (up to 100k rows) of records need to be deleted for > regulatory compliance. > * Once an hour a log of transactions is exported from a RDBS and the fact > tables need to be updated (up to 1m rows) to reflect the new data. The > transactions are a combination of inserts, updates, and deletes. The table is > partitioned and bucketed. -- This message was sent by Atlassian JIRA (v6.1#6144)