Hi,
it's probably beyond that. it may be an issue with the number of files
Kafka can have opened concurrently.
A previous conversation with Joe about (build failes for latest stable
source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe,
A's by me):

1. what else on the logs? [*see below*]
2. other broker failure reason? [*"*]
3. other broker failure after taking leadership? [*how can I be sure? ask
another to describe topic?*]
4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep
socket | wc -l, also did watch on that*]
5. is that number equals the number of {new Producer}? [*yes*]
6. how many topics? [*1*] how many partitions [*504*]
7. Are u using a partition key? [*yes, I use the python client with* ]









*class ProducerIdPartitioner(Partitioner):    """    Implements a
partitioner which selects the target partition based on the sending
producer ID    """    def partition(self, key, partitions):        size =
len(partitions)        prod_id = int(key)        idx = prod_id %
size        return partitions[idx]*
8. maybe running into over partitioned topic? [*producer instances is 6
machines * 84 procs * 24 threads, but never got to start them all*,*b/c of
errors*]
9. r u running anything else? [*yes, zookeeper*]


answer to 1,2:
the error's I see on the python client are first timeouts and then message
send failures, using sync send.

on the controller log:

ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
[Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to send
StopReplica request with correlation id 519 to broker
id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
(kafka.controller.RequestSendThread)
controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
[Controller-1-to-broker-3-send-thread], Controller 1's connection to broker
id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
(kafka.controller.RequestSendThread)

on the server log (selected greps):
...
server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
[ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270] to
broker 2:class kafka.common.NotLeaderForPartitionException
(kafka.server.ReplicaFetcherThread)
...
server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket for
/10.184.150.54 because of error (kafka.network.Processor)

...
server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1] Error
when processing fetch request for partition [vpq_android_gcm_h,184] offset
8798 from follower with correlation id 0 (kafka.server.KafkaApis)
...
erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
[ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version: 0;
CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1;
MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196]
-> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
...
server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in acceptor
(kafka.network.Acceptor)
...


and these may not be all (other logs may have some more of that)....


Joe said to just lower the number of connections but I still can't see the
exact problem.
is there a kafka limit to the number of concurrent open files? cause the
process was not limited...

Thanks,
Shlomi

On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <jun...@gmail.com> wrote:

> What type of error did you see? You may need to configure a larger open
> file handler limit.
>
> Thanks,
>
> Jun
>
> On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hzshl...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying to load a cluster with over than 10K connections, and bumped
> > into the error in the subject.
> > Is there any limitation on Kafka's side? if so it configurable? how?
> > on first look, it looks like the selector accepting the connection is
> > overflowing...
> >
> > Thanks.
> > --
> > Shlomi
> >
>

Reply via email to