Hi there! First off, thanks for the continued work on Samza -- I looked
into many DC/stream processors and Samza was a real standout with its
smart architecture and pluggable design.

I'm working on a custom StreamJob/Factory for running Samza jobs on
Kubernetes. Those two classes are pretty straight forward: I create a
Deployment in Kubernetes with the appropriate number of Pods (a number
<= the number of Kafka partitions I created the input topic with). Now
I'm moving onto what executes in the actual Docker containers and I'm a
bit confused.

My plan was to mirror as much as possible what the YarnJob does
which is setup an environment that will work with `run-jc.sh`. However,
I don't need ClusterBasedJobCoordinator because Kubernetes is not an
offer-based resource negotiator; if the JobCoordinator is running it
means, by definition, it received the appropriate resources. So a
PassThroughJobCoordinator with appropriate main() method seemed like the
ticket. Unfortunately, the PTJC doesn't actually seem to *do* anything
-- unlike the CBJC which has a run-loop and presumably schedules
containers and the like.

I saw the preview documentation on flexible deployment, but it didn't
totally click for me. Perhaps because it was also my first introduction
to the high-level API? (I've just been writing StreamTask impls)

Here's a brief description of the workflow I'm envisioning, perhaps
someone could tell me the classes I should implement and what sorts of
containers I might need running in the environment to coordinate
everything?

1. I create a topic in Kafka with N partitions
2. I start a job configured to run N-X containers
 2.1. If my topic has 4 partitions and I have low load, I might want
      X to start at 3 so I only have 1 task instance
3. Samza is configured to send all partitions to task instance 1
4. Later, load increases.
 4.1. I use Kubernetes to scale the job to 4 pods/containers
 4.2. Samza re-configures such that the new containers receive work

My intuition is that I need a JobCoordinator/Factory in the form of a
server that sets up Watches on the appropriate Kubernetes resources so
that when I perform the action in [4.1] *something* happens. Or perhaps
I should use ZkJobCoordinator? Presumably as pods/containers come and go
they will cause changes in ZK that will trigger Task restarts or
whatever logic the coordinator employs?

Okay, I'll stop rambling now. Thanks in advance for any tips!

- Tom

Reply via email to