[ https://issues.apache.org/jira/browse/HIVE-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Garima Dosi updated HIVE-15352: ------------------------------- Attachment: Hive MVCC - Requirement & Design.pdf > MVCC (Multi Versioned Concurrency Control) in Hive > -------------------------------------------------- > > Key: HIVE-15352 > URL: https://issues.apache.org/jira/browse/HIVE-15352 > Project: Hive > Issue Type: New Feature > Reporter: Garima Dosi > Attachments: Hive MVCC - Requirement & Design.pdf > > > Use Case > While working with providing solutions for various applications, we see that > there is at times, a need to provide multi version concurrency support for > certain datasets. The requirement of multi versioned concurrency is mainly > due to two reasons – > • Simultaneous querying and loading from tables or datasets, which requires > maintaining versions for reading and writing (Locking is not the right option > here) > • Maintaining historical load of tables/datasets upto some extent > Both of these requirements are seen in data management systems (warehouses > etc). > What happens without MVCC in Hive? > In cases, where MVCC had to be done, design similar to this - > https://dzone.com/articles/zookeeper-a-real-world-example-of-how-to-use-it > was followed to make it work. Zookeeper was used to maintain versions and > provide MVCC support. However, this design poses a limitation if a normal > user would like to query a hive table because he will not be aware of the > current version to be queried. The additional layer to match versions in > zookeeper with the dataset to be queried introduces a bit of an overhead for > normal users and hence, the request to make this feature available in Hive. > Hive Design for Support of MVCC > The hive design for MVCC support can be as described below (It would somewhat > follow the article mentioned in the previous section) – > 1. The first thing should be the ability for the user to specify that this is > a MVCC table. So, a DDL something like this – > create table <table_name> ( <column_specs>) MULTI_VERSIONED ON [sequence, > time] > Internally this DDL can be translated to a partitioned table either on a > sequence number (auto-generated by Hive) or a timestamp. The metastore would > keep this information. > 2. DMLs related to inserting or loading data to the table would remain the > same for an end user. However, internally Hive would automatically detect > that a table is a multi-versioned table and write the new data to a new > partition with a new version of the dataset. The Hive Metastore would also be > updated with the current version. > 3. DMLs related to querying data from the table would remain the same for a > user. However, internally Hive would use the latest version for queries. > Latest version is always stored in the metastore. > Management of obsolete versions > The obsolete versions can be deleted based on the following – > 1.Either a setting which simply says delete the version which is older than a > threshold and is not active, OR > 2.By tracking the count of queries running on older versions and deleting the > ones which are not the latest and are not being used by any query. This would > require some sort of a background thread monitoring the table for obsolete > versions. As shown in the article mentioned above, this would also require > incrementing version count whenever a version is queried and decrement it > once the query is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)