Thanks so much for the answers guys, they are really helpful. On Wed, Jun 17, 2015 at 1:57 AM, Bill Farner <wfar...@apache.org> wrote:
> Maxim's reply is correct, elaborating > > Should it assume the Mesos list is complete, and assume the missing nodes > > are indeed gone, and hence restart the jobs? > > > Yes. This scenario is currently reconciled by the GC executor, which runs > on an hourly interval by default. This behavior is soon to be replaced by > a newer process that should be able to provide greater responsiveness in > this situation. > How expensive the gc operation is? is it safe to execute it more frequently? (like each 10 minutes) > is there any guarantee that not multiple instances of the same job will be > > started? > > > Nope! Aurora is designed to converge towards the desired number of > instances of a job, but errs on the side of over-provisioning. This tends > to be the desired behavior in more cases than not. Applications requiring > an at-most instance count must implement that in the application layer, > likely leaning on something like ZooKeeper or etcd. > > If we had health checks, we could presumably use those to validate that the > > job is, indeed, truly dead. Would that work? > > > Health checks would not change behavior in this scenario, as it's only used > for node-local liveness monitoring. > > -=Bill > > On Tue, Jun 16, 2015 at 2:34 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > Not sure I am getting the problem here. Are you observing Mesos > > master, Aurora leader or a native log quorum loss? > > > > To your questions, every part of the Aurora/Mesos system is designed > > in a failure-tolerant manner. A loss of Mesos master, Aurora leader or > > a Mesos slave should not cause any irrecoverable data loss. All > > efforts are made to ensure tasks are restarted to compensate for any > > lost instances. There should be no duplicate jobs but there could be > > duplicate task instances for some time until Aurora/Mesos reconcile > > their state (usually within 1 hour). > > > > As for job health monitoring, I'd recommend exporting and alerting on > > job stats (similar to scheduler stats exposed via /vars endpoint). > > > > Thanks, > > Maxim > > > > On Tue, Jun 16, 2015 at 2:19 PM, Mauricio Garavaglia > > <mauriciogaravag...@gmail.com> wrote: > > > Hello! > > > > > > We had a issue with our aurora mesos cluster that make it to lose > quorum. > > > And we are wondering how the recover of lost jobs works. So, what > happen > > is > > > basically > > > > > > #1 Start Aurora job, and have it allocated to node A. > > > #2 Aurora Schedulers, Mesos Master and ZK stopped > > > #3 node A stopped > > > #4 Aurora Schedulers, Mesos Master and ZK started again > > > > > > Should it assume the Mesos list is complete, and assume the missing > nodes > > > are indeed gone, and hence restart the jobs? is there any guarantee > that > > > not multiple instances of the same job will be started? > > > > > > If we had health checks, we could presumably use those to validate that > > the > > > job is, indeed, truly dead. Would that work? > > > > > > Thanks! > > >