> > ... >
> We have most of the responsive nature of Juju is driven off the watchers. > These watchers watch the mongo oplog for document changes. What happened > was that there were so many mongo operations, the capped collection of the > oplog was completely replaced between our polled watcher delays. The > watchers then errored out in a new unexpected way. > > Effectively the watcher infrastructure needs an internal reset button that > it can hit when this happens that invalidates all the watchers. This should > cause all the workers to be torn down and restarted from a known good state. > Tim and I discussed this a bit. It probably wasn't the 'oplog' that overflowed, but actually the 'txns.log' collection. Which is also a capped collection at 10MB in size. The issue is likely that the 'txnsLogWorker' automatically restarted on an error, but the error actually meant that we're missing events, which means that all the watchers/workers that are relying on the event stream should be restarted. (we obviously can't know what events we're missing, cause they're missing.) So one argument is that txnsLogWorker should *not* be automatically restarted. Instead failures of that worker should actually be critical failures in the process and just cause the whole process to restart. The alternative is that we introduce a mechanism to cause all workers to restart (since they need to start fresh anyway), but restarting the agent has a similar effect. It is possible that we could whitelist some known errors that don't indicate we need a full restart, but those really should be a whitelist. John =:-> > > There was a model that got stuck being destroyed, this is tracked back to > a worker that should be doing the destructions not noticing. > > All the CPU usage can be tracked back to the 139 models in the apiserver > state pools each still running leadership and base watcher workers. The > state pool should have removed all these instances, but it didn't notice > they were gone. > > There are some other bugs around logging things as errors that really > aren't errors that contributed to log noise, but the fundamental error here > is not being robust in the face of too much change at once. > > This needs to be fixed for the 2.2 release candidate, so it may well push > that out past the end of this week. > > Tim > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: https://lists.ubuntu.com/mailm > an/listinfo/juju-dev >
-- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev