Hi, it's probably beyond that. it may be an issue with the number of files Kafka can have opened concurrently. A previous conversation with Joe about (build failes for latest stable source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe, A's by me):
1. what else on the logs? [*see below*] 2. other broker failure reason? [*"*] 3. other broker failure after taking leadership? [*how can I be sure? ask another to describe topic?*] 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep socket | wc -l, also did watch on that*] 5. is that number equals the number of {new Producer}? [*yes*] 6. how many topics? [*1*] how many partitions [*504*] 7. Are u using a partition key? [*yes, I use the python client with* ] *class ProducerIdPartitioner(Partitioner): """ Implements a partitioner which selects the target partition based on the sending producer ID """ def partition(self, key, partitions): size = len(partitions) prod_id = int(key) idx = prod_id % size return partitions[idx]* 8. maybe running into over partitioned topic? [*producer instances is 6 machines * 84 procs * 24 threads, but never got to start them all*,*b/c of errors*] 9. r u running anything else? [*yes, zookeeper*] answer to 1,2: the error's I see on the python client are first timeouts and then message send failures, using sync send. on the controller log: ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to send StopReplica request with correlation id 519 to broker id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker. (kafka.controller.RequestSendThread) controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR [Controller-1-to-broker-3-send-thread], Controller 1's connection to broker id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful (kafka.controller.RequestSendThread) on the server log (selected greps): ... server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270] to broker 2:class kafka.common.NotLeaderForPartitionException (kafka.server.ReplicaFetcherThread) ... server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket for /10.184.150.54 because of error (kafka.network.Processor) ... server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1] Error when processing fetch request for partition [vpq_android_gcm_h,184] offset 8798 from follower with correlation id 0 (kafka.server.KafkaApis) ... erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1; MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196] -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] -> PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] -> PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] -> PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] -> PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] -> PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] -> PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] -> PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] -> PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] -> PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] -> PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] -> PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] -> PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] -> PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] -> PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] -> PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] -> PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] -> PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] -> PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] -> PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] -> PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread) ... server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in acceptor (kafka.network.Acceptor) ... and these may not be all (other logs may have some more of that).... Joe said to just lower the number of connections but I still can't see the exact problem. is there a kafka limit to the number of concurrent open files? cause the process was not limited... Thanks, Shlomi On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <jun...@gmail.com> wrote: > What type of error did you see? You may need to configure a larger open > file handler limit. > > Thanks, > > Jun > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hzshl...@gmail.com> wrote: > > > Hi, > > > > I am trying to load a cluster with over than 10K connections, and bumped > > into the error in the subject. > > Is there any limitation on Kafka's side? if so it configurable? how? > > on first look, it looks like the selector accepting the connection is > > overflowing... > > > > Thanks. > > -- > > Shlomi > > >