[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

Peter Vary (JIRA) Thu, 20 Oct 2016 05:19:39 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591664#comment-15591664
 ]


Peter Vary commented on HIVE-14979:
-----------------------------------

I totally agree with you [~sershe]!

Here is what I know at the moment, thanks for the guys who helped out with 
extra info:
- There is 1 configuration value for ZooKeeper timeout 
(HIVE_ZOOKEEPER_SESSION_TIMEOUT) used by the service discovery and the locks as 
well. This is set to 20 minutes by default, and might be overwritten by the 
ZooKeeper maxSessionTimeout value to a lower value.
- If the HiveServer2 is shut down with normal methods, then it removes the 
ZooKeeper nodes as expected (at least I have yet to find an example to 
contradict this)
- If the HiveServer2 dies unexpectedly then ZooKeeper correctly removes the 
ephemeral nodes, but only after the session timeout is reached - with default 
configuration it could be 20 minutes
- The patch proposes a configuration option which - if enabled - at HiveServer2 
startup time will remove the remaining ZooKeeper lock nodes even if the 
ZooKeeper session timeout is not reached.
- So far I read a quiet good reason behind the large timeout (see: the comment 
by [~thejas], and 
http://stackoverflow.com/questions/14275613/concerns-about-zookeepers-lock-recipe).
 Session timeout is reliant on ping messages so a long GC or network congestion 
could cause session termination. ZooKeeper tries to ping an idle connection 
after 1/3 of the timeout, so the longer the timeout, the less probable to have 
a session terminated overzealously :).

I do not know enough about the external jobs yet, but I also think the 
remaining jobs could be a problem. All-in-all solving them with increased 
timeout does not strike me like a good solution: queries in Hive could be huge 
and could run for hours/days, so a 20 minutes timeout still not solves the 
problem at all. Am I right here, or missing some important points?

Thanks,
Peter

> Removing stale Zookeeper locks at HiveServer2 initialization
> ------------------------------------------------------------
>
>                 Key: HIVE-14979
>                 URL: https://issues.apache.org/jira/browse/HIVE-14979
>             Project: Hive
>          Issue Type: Improvement
>          Components: Locking
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>         Attachments: HIVE-14979.3.patch, HIVE-14979.4.patch, HIVE-14979.patch
>
>
> HiveServer2 could use Zookeeper to store token that indicate that particular 
> tables are locked with the creation of persistent Zookeeper objects. 
> A problem can occur when a HiveServer2 instance creates a lock on a table and 
> the HiveServer2 instances crashes ("Out of Memory" for example) and the locks 
> are not released in Zookeeper. This lock will then remain until it is 
> manually cleared by an admin.
> There should be a way to remove stale locks at HiveServer2 initialization, 
> helping the admins life.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-14979) Removing stale Zookeeper locks at HiveServer2 initialization

Reply via email to