[ 
https://issues.apache.org/jira/browse/HIVE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772192#comment-13772192
 ] 

Eric Hanson commented on HIVE-5317:
-----------------------------------

Overall this looks like a workable approach give the use cases described 
(mostly coarse grained updates with a low transaction rate), and it has the 
benefit that it doesn't take a dependency on another large piece of software 
like an update-aware DBMS or NoSQL store.

Regarding use cases, it appears that this design won't be able to have fast 
performance for fine-grained inserts. E.g. there might be scenarios where you 
want to insert one row into a fact table every 10 milliseconds in a separate 
transaction and have the rows immediately visible to readers. Are you willing 
to forgo that use case? It sounds like yes. This may be reasonable. If you want 
to handle it then a different design for the delta insert file information is 
probably needed, i.e. a store that's optimized for short write transactions.

I didn't see any obvious problem, due to the versioned scans, but is this 
design safe from the Halloween problem? That's the problem where an update scan 
sees its own updates again, causing an infinite loop or incorrect update. An 
argument that the design is safe from this would be good.

You mention that you will have one type of delta file that encodes updates 
directly, for sorted files. Is this really necessary, or can you make updates 
illegal for sorted files? If updates can always be modelled as insert plus 
deleted, that simplifies things.

How do you ensure that the delta files are fully written (committed) to the 
storage system before the metastore treats the transaction that created the 
delta file as committed?

It's not completely clear why you need exactly the transaction ID information 
specified in the delta file names. E.g. would just the transaction ID (start 
timestamp) be enough? A precise specification of how they are used would be 
useful.

Explicitly explaining what happens when a transaction aborts and how its delta 
files get ignored and then cleaned up would be useful.

Is there any issue with correctness of task retry in the presence of updates if 
a task fails? It appears that it is safe due to the snapshot isolation. 
Explicitly addressing this in the specification would be good.


                
> Implement insert, update, and delete in Hive with full ACID support
> -------------------------------------------------------------------
>
>                 Key: HIVE-5317
>                 URL: https://issues.apache.org/jira/browse/HIVE-5317
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: InsertUpdatesinHive.pdf
>
>
> Many customers want to be able to insert, update and delete rows from Hive 
> tables with full ACID support. The use cases are varied, but the form of the 
> queries that should be supported are:
> * INSERT INTO tbl SELECT …
> * INSERT INTO tbl VALUES ...
> * UPDATE tbl SET … WHERE …
> * DELETE FROM tbl WHERE …
> * MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN 
> ...
> * SET TRANSACTION LEVEL …
> * BEGIN/END TRANSACTION
> Use Cases
> * Once an hour, a set of inserts and updates (up to 500k rows) for various 
> dimension tables (eg. customer, inventory, stores) needs to be processed. The 
> dimension tables have primary keys and are typically bucketed and sorted on 
> those keys.
> * Once a day a small set (up to 100k rows) of records need to be deleted for 
> regulatory compliance.
> * Once an hour a log of transactions is exported from a RDBS and the fact 
> tables need to be updated (up to 1m rows)  to reflect the new data. The 
> transactions are a combination of inserts, updates, and deletes. The table is 
> partitioned and bucketed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to