Hi, Rick and John, Thanks for the great discussion! As Jacob said, we realized the possible drawbacks relying solely on YARN for process liveness detection as well and that's why SAMZA-871 was opened. Please help to comment on the JIRA so that we can track the discussion and move the design process forward.
Thanks a lot! -Yi On Wed, Feb 10, 2016 at 2:10 PM, Rick Mangi <r...@chartbeat.com> wrote: > Jake, Not my question, I was just adding my 2 cents :) > > John, it’s not that yarn is responsible for maintaining 1 instance of each > container, samza has an abstract management layer that defers this to yarn, > but some people bypass yarn all together and manage their containers > themselves or run on things like mesos. > > For your purposes though, if you are using yarn, then yes this is yarn’s > job. > > The case I ran into was with cloudera’s distro of yarn with an older > version of ubuntu and yarn. I haven’t seen zombies since moving to the > latest yarn distro. > > > > > On Feb 10, 2016, at 4:44 PM, Jacob Maes <jacob.m...@gmail.com> wrote: > > > > Hey Rick, > > > > If I understand your question, the goal is really to make sure there are > no > > orphaned containers that continue to run "off the books". > > > > The newly added SAMZA-871 describes a heart beat mechanism to make sure > > orphaned containers actually get killed. > > > > Also, the YARN Node Manager Restart capability might help. We're in the > > process of testing this at LinkedIn: > > > https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html > > > > -Jake > > > > On Wed, Feb 10, 2016 at 1:42 PM, John Dennison <dennison.j...@gmail.com> > > wrote: > > > >> To second Rick's point. Its less about malicious actors, but rather > >> containers thought to be lost due to a network partition popping up > later > >> and starting to write to the change log. I assume from Rick's response > that > >> yarn is responsible for ensure only one version of each container is > >> running and samza has nothing internal to deal with this. > >> > >> I guess you could hijack kafka's auth framework to block old zombie > >> containers from writing. Use some global lock's incrementing token as > the > >> password. A zombie process would auth with an old token and be denied. I > >> haven't looked but i imagine that 0.9.0 auth framework isn't done on a > >> partition level. > >> > >> On Wed, Feb 10, 2016 at 2:27 PM, Rick Mangi <r...@chartbeat.com> wrote: > >> > >>> Security wouldn’t stop zombie processes from writing to kafka. I had > this > >>> problem with yarn before where the container thought it was killing > jobs > >>> but they never actually died, and in fact continued to write to kafka. > >>> > >>> > >>>> On Feb 10, 2016, at 4:23 PM, Jagadish Venkatraman < > >>> jagadish1...@gmail.com> wrote: > >>>> > >>>> Hi John > >>>> > >>>> Currently there is no authorization on who writes to Kafka. There is a > >>>> Kafka security proposal that the kafka community is working on. > >>>> https://cwiki.apache.org/confluence/display/KAFKA/Security > >>>> > >>>> Building this into Samza may entail expensive coordination (to prevent > >>>> other jobs). Since, jobs are usually run in a trusted environment, > I've > >>> not > >>>> seen people requesting this use-case. Even if we did build this into > >>> Samza, > >>>> nothing stops people from writing to that Kafka topic by bypassing > >> Samza > >>>> completely. (thro' the kafka producer or external library) > >>>> > >>>> I'd think Kafka would build support for authorization, principals, > >> roles > >>>> etc. in the future and Samza can leverage it once it's done. > >>>> > >>>> Thoughts? > >>>> > >>>> On Wednesday, February 10, 2016, John Dennison < > >> dennison.j...@gmail.com> > >>>> wrote: > >>>> > >>>>> Greetings, > >>>>> > >>>>> I have general design question i did not see addressed in the docs. > >>>>> Basically how does samza guarantee a single writer for each changelog > >>>>> partition. Because of strong ordering assumption of these changelog, > >>> how do > >>>>> you protect against zombie processes writing to the changelog with > out > >>> of > >>>>> date values. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> John > >>>>> > >>> > >>> > >> > >