Ah, then yea, checkpointing should solve your problem. Let's do that. On Wed, May 24, 2017 at 11:19 AM, Charles Allen < charles.al...@metamarkets.com> wrote:
> The issue on our side is we tend to roll out a bunch of agent updates at > about the same time. So rolling an agent, then waiting for spark jobs to > recover, then rolling another agent is not at all practical. It is a huge > benefit if we can just update the agents in bulk (or even sequentially, but > only waiting for the mesos agent to recover). > > On Wed, May 24, 2017 at 11:17 AM Michael Gummelt <mgumm...@mesosphere.io> > wrote: > >> > We had investigated internally recently why restarting the mesos >> agents failed the spark jobs (no real reason they should, right?) and came >> across the data. >> >> Restarting the agent without checkpointing enabled will kill the >> executor, but that still shouldn't cause the Spark job to fail, since Spark >> jobs should tolerate executor failures. >> >> On Mon, Apr 3, 2017 at 2:26 PM, Timothy Chen <tnac...@gmail.com> wrote: >> >>> Yes, adding the timeout config should be the only code change required. >>> >>> And just to clarify, this is for reconnecting with Mesos master (not >>> agents) after failover. >>> >>> Tim >>> >>> On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen >>> <charles.al...@metamarkets.com> wrote: >>> > We had investigated internally recently why restarting the mesos agents >>> > failed the spark jobs (no real reason they should, right?) and came >>> across >>> > the data. The other conversation by Yu sparked trying to poke to get >>> some of >>> > the tickets updated to spread around any tribal knowledge that is >>> floating >>> > in the community. >>> > >>> > It sounds like the only thing keeping it from being enabled is a >>> timeout >>> > config and someone volunteering to do some testing? >>> > >>> > >>> > On Mon, Apr 3, 2017 at 2:19 PM Timothy Chen <tnac...@gmail.com> wrote: >>> >> >>> >> The only reason is that MesosClusterScheduler by design is long >>> >> running so we really needed it to have failover configured correctly. >>> >> >>> >> I wanted to create a JIRA ticket to allow users to configure it for >>> >> each Spark framework, but just didn't remember to do so. >>> >> >>> >> Per another question that came up in the mailing list, I believe we >>> >> should add it as it's a fairly straight forward effort. >>> >> >>> >> Tim >>> >> >>> >> On Mon, Apr 3, 2017 at 2:16 PM, Charles Allen >>> >> <charles.al...@metamarkets.com> wrote: >>> >> > As per https://issues.apache.org/jira/browse/SPARK-4899 >>> >> > >>> >> > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils# >>> createSchedulerDriver >>> >> > allows checkpointing, but only >>> >> > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler >>> uses it. >>> >> > Is >>> >> > there a reason for that? >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> >> -- >> Michael Gummelt >> Software Engineer >> Mesosphere >> > -- Michael Gummelt Software Engineer Mesosphere