Re: Lost jobs on cluster failure

Mauricio Garavaglia Wed, 17 Jun 2015 07:30:55 -0700

Thanks so much for the answers guys, they are really helpful.

On Wed, Jun 17, 2015 at 1:57 AM, Bill Farner <wfar...@apache.org> wrote:


> Maxim's reply is correct, elaborating
>
> Should it assume the Mesos list is complete, and assume the missing nodes
> > are indeed gone, and hence restart the jobs?
>
>
> Yes.  This scenario is currently reconciled by the GC executor, which runs
> on an hourly interval by default.  This behavior is soon to be replaced by
> a newer process that should be able to provide greater responsiveness in
> this situation.
>

How expensive the gc operation is? is it safe to execute it more
frequently? (like each 10 minutes)



> is there any guarantee that not multiple instances of the same job will be
> > started?
>
>
> Nope!  Aurora is designed to converge towards the desired number of
> instances of a job, but errs on the side of over-provisioning.  This tends
> to be the desired behavior in more cases than not.  Applications requiring
> an at-most instance count must implement that in the application layer,
> likely leaning on something like ZooKeeper or etcd.
>
> If we had health checks, we could presumably use those to validate that the
> > job is, indeed, truly dead. Would that work?
>
>
> Health checks would not change behavior in this scenario, as it's only used
> for node-local liveness monitoring.
>
> -=Bill
>
> On Tue, Jun 16, 2015 at 2:34 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > Not sure I am getting the problem here. Are you observing Mesos
> > master, Aurora leader or a native log quorum loss?
> >
> > To your questions, every part of the Aurora/Mesos system is designed
> > in a failure-tolerant manner. A loss of Mesos master, Aurora leader or
> > a Mesos slave should not cause any irrecoverable data loss. All
> > efforts are made to ensure tasks are restarted to compensate for any
> > lost instances. There should be no duplicate jobs but there could be
> > duplicate task instances for some time until Aurora/Mesos reconcile
> > their state (usually within 1 hour).
> >
> > As for job health monitoring, I'd recommend exporting and alerting on
> > job stats (similar to scheduler stats exposed via /vars endpoint).
> >
> > Thanks,
> > Maxim
> >
> > On Tue, Jun 16, 2015 at 2:19 PM, Mauricio Garavaglia
> > <mauriciogaravag...@gmail.com> wrote:
> > > Hello!
> > >
> > > We had a issue with our aurora mesos cluster that make it to lose
> quorum.
> > > And we are wondering how the recover of lost jobs works. So, what
> happen
> > is
> > > basically
> > >
> > > #1 Start Aurora job, and have it allocated to node A.
> > > #2 Aurora Schedulers, Mesos Master and ZK stopped
> > > #3 node A stopped
> > > #4 Aurora Schedulers, Mesos Master and ZK started again
> > >
> > > Should it assume the Mesos list is complete, and assume the missing
> nodes
> > > are indeed gone, and hence restart the jobs? is there any guarantee
> that
> > > not multiple instances of the same job will be started?
> > >
> > > If we had health checks, we could presumably use those to validate that
> > the
> > > job is, indeed, truly dead. Would that work?
> > >
> > > Thanks!
> >
>

Re: Lost jobs on cluster failure

Reply via email to