Hi Abhishek,

Heartbeat between the AM and container has been a long awaited Samza
feature. It will go a long way in ensuring our reliability! +1 for this SEP.

*High level comments:*

Currently, the only use-case for the heartbeat mechanism seems to be when
running Samza on Yarn. IMHO, it makes sense to pull the heart beat logic
into the *LocalContainerRunner* instead of baking it into the
*SamzaContainer* class. Long term, we can re-visit this when we have a
pluggable liveness detection mechanism.

I'm thinking of a flow like this:

There is a separate component (or a thread) inside LocalContainerRunner
that periodically polls the coordinator, and determines if it should
continue running. If the coordinator determines that the container should
not run, the *LocalContainerRunner* cleanly shuts-down the container and
the process exits with a non-zero exit status.

The following nice properties fall out:

   - We can remove the proposed config *job.container.validator.enabled. *
   - We can also remove the proposed *Killable* interface since
   *SamzaContainer* (and runLoops) don't have to implement *Killable *
   anymore. The life-cycle is managed by the *LocalContainerRunner* that
   started it.

*On the proposed public interfaces:*

*job.container.validator.enabled:  *I am not in favor of adding this as a
new public config. IIUC, When running Samza jobs on Yarn, we always want
the validator/heartbeats to be enabled. OTOH, when running Samza jobs in
standalone mode, we currently do not have a pluggable mechanism for
heartbeat.

*job.container.schedule.ms <http://job.container.schedule.ms>: *It does
seem that we can pick a sensible default, and be done with it (instead of
adding a new config)? Is there a reason this needs to be configurable?

*On proposed Killable interface: *

Not entirely sure we need this new "*Killable"* interface (esp. given that
there's currently only one implementation - *SamzaContainer*).

   - The *LocalContainerRunner* can instead directly invoke shut-down on
   the *SamzaContainer* when its heart-beat expires. The extra level of
   indirection (making *SamzaContainer* to implement *Killable*) is
   probably unnecessary IMHO.


   - Since, the *LocalContainerRunner* invokes *start/run* on the
   *SamzaContainer*, it seems simpler also have it invoke *shutdown* on the
   *SamzaContainer. *

*Minor Comments:*

>> Expose a REST endpoint (eg: /isContainerValid) who's purpose is to get
requests from the Samza container periodically and respond back weather the
container is in the Job Coordinator's current list of valid containers.

Wondering if it'd be slightly cleaner to rename this to */heartBeatRequest*
and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
*isContainerValid
* and the definition of validity does seem slightly broad?

Thanks again for taking the time to draft the SEP, and volunteering to
implement this. Nice work!

Best,
Jagadish

On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <abks...@gmail.com>
wrote:

> Hi Everyone,
>
> In order to fix the issue of orphaned/leaky containers seen when the
> YARN Node Manager crashes, I have created a SEP discussing the design for
> implementing a heartbeat between the containers and the job coordinator:
> https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> all+running+containers
>
> Please take a look and provide feedback. I would also really appreciate
> help in designing a way to propagate the error up from SamzaContainer in
> order to exit the container with a non-zero exit code.
>
> Thanks,
> Abhishek
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Reply via email to