Hi, sorry, what do you mean by 'container'? I use bare EC2 instances... Shlomi
On Wed, Sep 10, 2014 at 1:41 AM, Jun Rao <jun...@gmail.com> wrote: > Are you starting the broker in some container? You want to make sure that > the container doesn't overwrite the open file handler limit. > > Thanks, > > Jun > > On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <shl...@viber.com> wrote: > > > Hi, > > it's probably beyond that. it may be an issue with the number of files > > Kafka can have opened concurrently. > > A previous conversation with Joe about (build failes for latest stable > > source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe, > > A's by me): > > > > 1. what else on the logs? [*see below*] > > 2. other broker failure reason? [*"*] > > 3. other broker failure after taking leadership? [*how can I be sure? ask > > another to describe topic?*] > > 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep > > socket | wc -l, also did watch on that*] > > 5. is that number equals the number of {new Producer}? [*yes*] > > 6. how many topics? [*1*] how many partitions [*504*] > > 7. Are u using a partition key? [*yes, I use the python client with* ] > > > > > > > > > > > > > > > > > > > > *class ProducerIdPartitioner(Partitioner): """ Implements a > > partitioner which selects the target partition based on the sending > > producer ID """ def partition(self, key, partitions): size = > > len(partitions) prod_id = int(key) idx = prod_id % > > size return partitions[idx]* > > 8. maybe running into over partitioned topic? [*producer instances is 6 > > machines * 84 procs * 24 threads, but never got to start them all*,*b/c > of > > errors*] > > 9. r u running anything else? [*yes, zookeeper*] > > > > > > answer to 1,2: > > the error's I see on the python client are first timeouts and then > message > > send failures, using sync send. > > > > on the controller log: > > > > ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR > > [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to > send > > StopReplica request with correlation id 519 to broker > > id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker. > > (kafka.controller.RequestSendThread) > > controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR > > [Controller-1-to-broker-3-send-thread], Controller 1's connection to > broker > > id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful > > (kafka.controller.RequestSendThread) > > > > on the server log (selected greps): > > ... > > server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR > > [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270] > to > > broker 2:class kafka.common.NotLeaderForPartitionException > > (kafka.server.ReplicaFetcherThread) > > ... > > server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket > for > > /10.184.150.54 because of error (kafka.network.Processor) > > > > ... > > server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1] > > Error > > when processing fetch request for partition [vpq_android_gcm_h,184] > offset > > 8798 from follower with correlation id 0 (kafka.server.KafkaApis) > > ... > > erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR > > [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version: > 0; > > CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1; > > MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196] > > -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] -> > > PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] -> > > PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] -> > > PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] -> > > PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] -> > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] -> > > PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] -> > > PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] -> > > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] -> > > PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] -> > > PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] -> > > PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] -> > > PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] -> > > PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] -> > > PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] -> > > PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] -> > > PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] -> > > PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] -> > > PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] -> > > PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] -> > > PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread) > > ... > > server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in > acceptor > > (kafka.network.Acceptor) > > ... > > > > > > and these may not be all (other logs may have some more of that).... > > > > > > Joe said to just lower the number of connections but I still can't see > the > > exact problem. > > is there a kafka limit to the number of concurrent open files? cause the > > process was not limited... > > > > Thanks, > > Shlomi > > > > On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > What type of error did you see? You may need to configure a larger open > > > file handler limit. > > > > > > Thanks, > > > > > > Jun > > > > > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hzshl...@gmail.com> > > wrote: > > > > > > > Hi, > > > > > > > > I am trying to load a cluster with over than 10K connections, and > > bumped > > > > into the error in the subject. > > > > Is there any limitation on Kafka's side? if so it configurable? how? > > > > on first look, it looks like the selector accepting the connection is > > > > overflowing... > > > > > > > > Thanks. > > > > -- > > > > Shlomi > > > > > > > > > >