On Tue, Mar 7, 2017 at 10:10 AM, Charan Reddy G <reddychara...@gmail.com> wrote:
> Hey, > > *Issue 1:* > > I was going through the code to understand how BookiesChanged event is > handled for AuditTask (in Auditor, BookkeeperAdmin and BookieWatcher > classes). In BookieWatcher.notifyBookiesChanged they are making > getChildren > call on "bookieRegistrationPath" and setting watcher (which listens for > NodeChildrenChanged event). Now as we know, Zookeeper watchers are one time > triggers, and hence in our code on consuming the watch event we reset > another watch to get notified of future changes. > > But due to the limitation of ZK (which is well documented thing in ZK > https://zookeeper.apache.org/doc/r3.4.5/zookeeperProgrammers.html#ch_ > zkWatches), "Because watches are one time triggers and there is latency > between getting the event and sending a new request to get a watch you > cannot reliably see every change that happens to a node in ZooKeeper. Be > prepared to handle the case where the znode changes multiple times between > getting the event and setting the watch again." > > So now in the implementation - (Auditor.notifyBookiesChanged, > Auditor.availableBookiesChanged, > BookKeeperAdmin.notifyBookiesChanged and BookieWatcher. > notifyBookiesChanged) > there is a possibility that if multiple bookies go off at a time, we might > miss handling BookiesChanged event in Auditor, which is a serious risk. > > Disussed with my teammember Kishore briefly, that to overcome this issue we > should compare return value of getChildren("bookieRegistrationPath") (in > BookieWatcher.notifyBookiesChanged) with previous known state to recognize > the changes, if any happenend, during the transition period. > > Am I missing something here? otherwise I think it is a serious flaw in the > implmentation. > > *Issue 2:* Also in Auditor class, if any exception happens in > notifyBookieChanges(), then we are just logging error and proceeding. > Which basically means that we may not successfully able to set watcher but > we are still proceeding. In this case aren't we missing BookiesChanged > event completely from then on? > If the zookeeper client is configured with retries, the getChildren will be retried with transient failures before throwing any exceptions. I believe the zookeeper client at bookie server is configured with infinite retries by default, no? > > @Override > public void availableBookiesChanged() { > // since a watch is triggered, we need to watch again on the > bookies > try { > notifyBookieChanges(); > } *catch (BKException bke) {* > * LOG.error("Exception while registering for a bookie change > notification", bke);* > * }* > submitAuditTask(); > } > > Thanks, > Charan >