We've setup jmxtrans and use it to check these two values. UncleanLeaderElectionsPerSec UnderReplicatedPartitions
Here is our shinken/nagios configuration: define command { command_name check_kafka_underreplicated command_line $USER1$/check_jmx -U service:jmx:rmi:///jndi/rmi://$HOSTADDRESS$:9999/jmxrmi -O "kafka.server":type="ReplicaManager",name="UnderReplicatedPartitions" -A Value -w $ARG1$ -c $ARG2$ } define command { command_name check_kafka_uncleanleader command_line $USER1$/check_jmx -U service:jmx:rmi:///jndi/rmi://$HOSTADDRESS$:9999/jmxrmi -O "kafka.controller":type="ControllerStats",name="UncleanLeaderElectionsPerSec" -A Count -w $ARG1$ -c $ARG2$ } define service { hostgroup_name KafkaBroker use generic-service service_description Kafka Unclean Leader Elections per sec check_command check_kafka_uncleanleader!1!10 check_interval 15 retry_interval 5 } define service { hostgroup_name KafkaBroker use generic-service service_description Kafka Under Replicated Partitions check_command check_kafka_underreplicated!1!10 check_interval 15 retry_interval 5 } ################################################################################################################################################################################################################################################################################################################ *DELTA PROJECTS* *Elias Abacioglu* Infrastructure Specialist at Delta Projects AB *E-mail*: elias.abacio...@deltaprojects.com *Office*: +46 8 667 76 90 *Mobile*: +46 70 222 59 25 *Office*: Banérgatan 10, SE-115 23 Stockholm, Sweden website <http://www.deltaprojects.com> | map <http://goo.gl/maps/P3I48> | support <supp...@deltaprojects.com> | twitter <https://twitter.com/DeltaProjects_> | linkedin <http://www.linkedin.com/company/delta-projects?trk=hb_tab_compy_id_142164> On Mon, Feb 29, 2016 at 12:41 PM, tao xiao <xiaotao...@gmail.com> wrote: > Thanks Jens. What I want to achieve is to check every broker within a > cluster functions probably. The way you suggest can identify the liveness > of a cluster but it doesn't necessarily mean every broker in the cluster is > alive. In order to achieve that I can either create a topic with number of > partitions being same as the number of brokers and min.insync.isr=number of > brokers or one topic per broker and then send ping message to broker. But > this approach is definitely not scalable as we expand the cluster. > Therefore I am looking for a way to achieve this. > > On Mon, 29 Feb 2016 at 16:54 Jens Rantil <jens.ran...@tink.se> wrote: > > > Hi, > > > > I assume you first want to ask yourself what liveness you would like to > > check for. I guess the most realistic check is to put a "ping" message on > > the broken and make sure that you can consume it. > > > > Cheers, > > Jens > > > > On Fri, Feb 26, 2016 at 12:38 PM, tao xiao <xiaotao...@gmail.com> wrote: > > > > > Hi team, > > > > > > What is the best way to verify a specific Kafka node functions > properly? > > > Telnet the port is one of the approach but I don't think it tells me > > > whether or not the broker can still receive/send traffics. I am > thinking > > to > > > ask for metadata from the broker using consumer.partitionsFor. If it > can > > > return partitioninfo it is considered live. Is this a good approach? > > > > > > > > > > > -- > > Jens Rantil > > Backend engineer > > Tink AB > > > > Email: jens.ran...@tink.se > > Phone: +46 708 84 18 32 > > Web: www.tink.se > > > > Facebook <https://www.facebook.com/#!/tink.se> Linkedin > > < > > > http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary > > > > > Twitter <https://twitter.com/tink> > > >