Hi, Robert, The main reason that ThreadJobFactory and ProcessJobFactory are not considered "production-ready" is that there is only one container for the job and all tasks are assigned to the single container. Hence, it is not easy to scale out of a single host.
As Rick mentioned, Netflix has put up a patch in SAMZA-41 based on 0.9.1 o allow static assignment of a subset of partitions to a single ProcessJob, which allows to launch multiple ProcessJobs in different hosts. We planned to merge it to 0.10. But it turns out that too much changes have gone into 0.10 and it became difficult to merge the patch. At this point, we can still try the following two options: 1) We can attempt to merge SAMZA-41 to 0.10.1 again, it may take some effort but would give a stop-gap solution. 2) We are working on a standalone Samza model (SAMZA-516, SAMZA-881) to allow users to run Samza w/o depending on YarnJobFactory. This is a long-term effort and will take some time to flesh out. Please join the discussion there s.t. we can be more aligned in our effort. Hope the above gives you an overall picture on where we are going. Thanks a lot! -Yi On Wed, Mar 2, 2016 at 1:28 PM, Rick Mangi <r...@chartbeat.com> wrote: > There was an interesting thread a while back from I believe the netflix > guys about running ThreadJobFactory in production. > > > > On Mar 2, 2016, at 4:20 PM, Robert Crim <rjc...@gmail.com> wrote: > > > > Hi, > > > > We're currently working on a solution that allows us to run Samza jobs on > > Mesos. This seems to be going well, and something we'd like to move away > > from when native Mesos support is added to Samza. > > > > While we're developing and testing our scheduler, I'm wondering about the > > implications of running tasks with the ThreadJobFactory in "production". > > The documentation advise against this, but it's not clear why. > > > > If we were using the ThreadJobFactory inside of a docker container on > Mesos > > with Marathon for production, would be our main problem? These are not > > particularly high-load tasks. Aside from not be able to get find-grained > > resource scheduling per-task, it seems like the main issue the not being > to > > easily tell when a job stops due to error / exception. > > > > In other words, what would be stop-stopping reasons to not use the > > TreadJobFactory in production? > > > > Thanks, > > Rob > >