Roger, You are welcomed. If you want to experiment, you can use my hello samza <https://hub.docker.com/r/elevy/hello-samza/> Docker image.
On Sun, Nov 29, 2015 at 12:19 PM, Roger Hoover <roger.hoo...@gmail.com> wrote: > Elias, > > I would also love to be able to deploy Samza on Kubernetes with dynamic > task management. Thanks for sharing this. It may be a good interim > solution. > > Roger > > On Sun, Nov 29, 2015 at 11:18 AM, Elias Levy <fearsome.lucid...@gmail.com> > wrote: > > > I've been exploring Samza for stream processing as well as Kubernetes as > a > > container orchestration system and I wanted to be able to use one with > the > > other. The prospect of having to execute YARN either along side or on > top > > of Kubernetes did not appeal to me, so I developed a KubernetesJob > > implementation of SamzaJob. > > > > You can find the details at > https://github.com/eliaslevy/samza_kubernetes, > > but in summary KubernetesJob executes and generates a serialized > JobModel. > > Instead of interacting with Kubernetes directly to create the > > SamzaContainers (as the YarnJob's SamzaApplicationMaster may do with the > > YARN RM), it output a config YAML file that can be used to create the > > SamzaContainers in Kubernetes by using Resource Controllers. For this > you > > require to package your job as a Docker image. You can reach the README > at > > the above repo for details. > > > > A few observations: > > > > It would be useful if SamzaContainer accepted the JobModel via an > > environment variable. Right not it expects a URL to download it from. I > > get around this by using a entry point script that copies the model from > an > > environment variable into a file, then passes a file URL to > SamzaContainer. > > > > SamzaContainer doesn't allow you to configure the JMX port. It selects a > > port at random from the ephemeral range as it expects to execute in YARN > > where a static port could result in a conflict. This is not the case in > > Kubernetes where each Pod (i.e. SamzaContainer) is given its own IP > > address. > > > > This implementation doesn't provide a Samza dashboard, which in the YARN > > implementation is hosted in the Application Master. There didn't seem to > > be much value provided by the dashboard that is not already provided by > the > > Kubernetes tools for monitoring pods. > > > > I've successfully executed the hello-samza jobs in Kubernetes: > > > > $ kubectl get po > > NAME READY STATUS RESTARTS AGE > > kafka-1-jjh8n 1/1 Running 0 2d > > kafka-2-buycp 1/1 Running 0 2d > > kafka-3-tghkp 1/1 Running 0 2d > > wikipedia-feed-0-4its2 1/1 Running 0 1d > > wikipedia-parser-0-l0onv 1/1 Running 0 17h > > wikipedia-parser-1-crrxh 1/1 Running 0 17h > > wikipedia-parser-2-1c5nn 1/1 Running 0 17h > > wikipedia-stats-0-3gaiu 1/1 Running 0 16h > > wikipedia-stats-1-j5qlk 1/1 Running 0 16h > > wikipedia-stats-2-2laos 1/1 Running 0 16h > > zookeeper-1-1sb4a 1/1 Running 0 2d > > zookeeper-2-dndk7 1/1 Running 0 2d > > zookeeper-3-46n09 1/1 Running 0 2d > > > > > > Finally, accessing services within the Kubernetes cluster from the > outside > > is quite cumbersome unless one uses an external load balancer. This > makes > > it difficult to bootstrap a job, as SamzaJob must connect to Zookeeper > and > > Kafka to find out the number of partitions on the topics it will > subscribe > > to, so it can assign them statically among the number of containers > > requested. > > > > Ideally Samza would operate along the lines of the Kafka high-level > > consumer, which dynamically coordinate to allocate work among members of > a > > consumer group. This would do away with the new to execute SamzaJob a > > priori to generate the JobModel to pass to the SamzaContainers. It would > > also allow for dynamically changing the number of containers without > having > > the shutdown the job. > > >