I really recommend the book "Kafka, the definitive guide" it's really useful for people running clusters, lots of good advice on tuning, metrics etc.
Basically, you scale your cluster when you're hitting limits of your most important resources (to Kafka) on the broker nodes - CPU, network or disk. As identified by various OS and Kafka metrics - one that is recommended in the book I mentioned above is RequestHandlerAvgIdlePercent - the book states that if it drops below 20% it indicates potential problems, and if below 10% definite performance problems. In terms of whether to scale horizontally or vertically, that really depends on the costs involved in either option. Although if you're saturating the network interface on a node, you can't really scale that one vertically. Kind regards, Liam Clarke On 27 Dec. 2018 1:24 pm, "Harper Henn" <harper.h...@datto.com> wrote: Hi, Many articles exist about running Kafka at scale, but there are fewer resources for learning when to grow your cluster (e.g. adding a new broker or upgrading the computer it's running on). At first, the answer to that seems straightforward - you add a broker if you want to reduce the amount of network I/O, CPU utilization, etc. a broker experiences. But when and how do you know a brokers are taxed too heavily and it's time to add a new one? Any thoughts about scaling by adding brokers vs. scaling with more powerful hardware? Best, Harper