I'll chime in to say I am running the standalone mode successfully in
Kubernetes. The ZK coordinator is very useful in this context as you can
partition a topic for max *desired* parallelism without continually
running that many nodes. You could also use the "operator" pattern in
Kube to create a native coordinator for that environment, but I haven't
been particularly motivated to do so.

You can really get the abstractions of Samza with the coordinators and
scheduling that determines how tasks are grouped and such. If you don't
want the overhead of those things, you're probably better off just using
Kafka Streams directly.

The minimum number of partitions you can "get away with" is one. To
calculate a floor beyond that would require a lot more knowledge about
your throughput, cluster volatility (more partitions + more nodes coming
and going will increase bookkeeping overhead), retention goals, etc.


Jagadish Venkatraman <jagadish1...@gmail.com> writes:

The standalone mode was introduced for this exact reason for customers who
don’t want to run YARN.

Have you considered running Samza in stand-alone mode? In this mode, Samza
is an embedded library - very similar to Kafka Streams.

https://samza.apache.org/learn/documentation/latest/deployment/standalone.html

A good rule of thumb when deciding the number of partitions(N) is  : “how
much data do you want to retain per-partition at anytime?” You can pick N
such that you retain around 20G. Another factor to consider is whether you
are getting adequate compute parallelism.


On Tuesday, February 19, 2019, Jeremiah Adams <jad...@helixeducation.com>
wrote:

We are finding YARN and AWS Ec2 to be too costly for us. We are having to
scale the cluster to support more jobs and have plans to write more jobs.
We are scaling because cluster doesn’t have enough VCores to support all
the Containers, not enough RAM for jobs, etc.

Has anyone had luck running Samza jobs in an alternative scheduler? Say,
Nomad, Kubernetes or something else?

Similarly, anyone have any luck with Samza on something like Kafka’s
streams where I don’t have to have the overhead of YARN and a scheduler at
all?

Also, at a small scale shop – what is the minimum number of partitions I
can get away with? Any advice on determining the appropriate number of
partitions?  Kafka, Zookeeper and Secor  are also costs we could
potentially reduce via partition count.


Thanks for any input.



Jeremiah Adams
Software Engineer
www.helixeducation.com<http://www.helixeducation.com/>
Blog<http://www.helixeducation.com/blog/> | Twitter<https://twitter.com/
HelixEducation> | Facebook<https://www.facebook.com/HelixEducation> |
LinkedIn<http://www.linkedin.com/company/3609946>


Reply via email to