[ 
https://issues.apache.org/jira/browse/HIVE-12285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982444#comment-14982444
 ] 

Elliot West commented on HIVE-12285:
------------------------------------

[~ekoifman], [~sushanth]: I very much appreciate your comprehensive comments. 
The historical context is particularly intriguing and is a great description of 
how features can evolve organically in unexpected directions. The fact that 
HCatalog didn't replace the metastore is unfortunate but I can understand how 
it occurred. However, the lineage of {{HCatClientHMSImpl}} is a remarkable 
example of an unintended side effect! Thank you for sharing. I really like the 
original vision, it's a shame it didn't come to fruition.

Regarding a top-level Hive API package: I do wonder whether the time is here 
where such an API is a necessity. For most Big Data developers, whether they 
are building systems using MR/Pig/Cascading/Spark/Flink/etc., Hive 
interoperability is almost always a requirement. Additionally, Hive contains 
solutions for some difficult problems shared by all of these frameworks such 
as: how to manage concurrent access to changing data, and how to efficiently 
mutate data. Currently I see frameworks integrating with Hive in a variety of 
ways including: using the {{Driver}}, {{IMetaStoreClient}}, {{HCatClient}} as 
well as sometimes using the HCat storage handlers but other times emulating 
this with some bespoke code. While it is unfortunate that multiple frameworks 
may re-implement code to add a partition for example, it's a simple operation 
and presents little risk to the integrity of data in Hive. However, complex 
features such as locking and transaction management are a very different 
matter. I suspect that it is very hard to correctly implement clients that 
utilise these features. Additionally, it would seem that there is a far greater 
opportunity to unintentionally introduce inconsistencies into the system.

As an API user the problem is not the lack of an implementation, but knowing 
which of the many implementations to invest in. Initially this might largely be 
solved by distilling the content in [~sushanth]'s comment to a 'Hive APIs' wiki 
page.

I've gone completely off the topic of this JIRA now, but in summary I'm taking 
from this that {{HCatClientHMSImpl}} would be a reasonable target for locking 
functionality for the following reasons:
* It is a popular interface, suggested for use by external callers.
* It may well serve as the basis for a top-level {{hive-api}}.




> Add locking to HCatClient
> -------------------------
>
>                 Key: HIVE-12285
>                 URL: https://issues.apache.org/jira/browse/HIVE-12285
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 2.0.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: concurrency, hcatalog, lock, locking, locks
>
> With the introduction of a concurrency model (HIVE-1293) Hive uses locks to 
> coordinate  access and updates to both table data and metadata. Within the 
> Hive CLI such lock management is seamless. However, Hive provides additional 
> APIs that permit interaction with data repositories, namely the HCatalog 
> APIs. Currently, operations implemented by this API do not participate with 
> Hive's locking scheme. Furthermore, access to the locking mechanisms is not 
> exposed by the APIs (as is the case with the Metastore Thrift API) and so 
> users are not able to explicitly interact with locks either. This has created 
> a less than ideal situation where users of the APIs have no choice but to 
> manipulate these data repositories outside of the command of Hive's lock 
> management, potentially resulting in situations where data inconsistencies 
> can occur both for external processes using the API and for queries executing 
> within Hive.
> h3. Scope of work
> This ticket is concerned with sections of the HCatalog API that deal with DDL 
> type operations using the metastore, not with those whose purpose is to 
> read/write table data. A separate issue already exists for adding locking to 
> HCat readers and writers (HIVE-6207).
> h3. Proposed work
> The following work items would serve as a minimum deliverable that would both 
> allow API users to effectively work with locks:
> * Comprehensively document on the wiki the locks required for various Hive 
> operations. At a minimum this should cover all operations exposed by 
> {{HCatClient}}. The [Locking design 
> document|https://cwiki.apache.org/confluence/display/Hive/Locking] can be 
> used as a starting point or perhaps updated.
> * Implement methods and types in the {{HCatClient}} API that allow users to 
> manipulate Hive locks. For the most part I'd expect these to delegate to the 
> metastore API implementations:
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
> ** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
> ** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
> ** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
> ** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
> ** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
> ** {{org.apache.hadoop.hive.metastore.api.LockType}}
> ** {{org.apache.hadoop.hive.metastore.api.LockState}}
> ** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
> h3. Additional proposals
> Explicit lock management should be fairly simple to add to {{HCatClient}}, 
> however it puts the onus on the API user to correctly understand and 
> implement code that uses lock in an appropriate manner. Failure to do so may 
> have undesirable consequences. With a simpler user model the operations 
> exposed on the API would automatically acquire and release the locks that 
> they need. This might work well for small numbers of operations, but not 
> perhaps for large sequences of invocations. (Do we need to worry about this 
> though as the API methods usually accept batches?).  Additionally tasks such 
> as heartbeat management could also be handled implicitly for long running 
> sets of operations. With these concerns in mind it may also be beneficial to 
> deliver some of the following:
> * A means to automatically acquire/release appropriate locks for 
> {{HCatClient}} operations.
> * A component that maintains a lock heartbeat from the client.
> * A strategy for switching between manual/automatic lock management, 
> analogous to SQL's {{autocommit}} for transactions.
> An API for lock and heartbeat management already exists in the HCatalog 
> Mutation API (see: 
> {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}). It will likely 
> make sense to refactor either this code and/or code that uses it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to