Hi, sorry, what do you mean by 'container'? I use bare EC2 instances...
Shlomi

On Wed, Sep 10, 2014 at 1:41 AM, Jun Rao <jun...@gmail.com> wrote:

> Are you starting the broker in some container? You want to make sure that
> the container doesn't overwrite the open file handler limit.
>
> Thanks,
>
> Jun
>
> On Tue, Sep 9, 2014 at 12:05 AM, Shlomi Hazan <shl...@viber.com> wrote:
>
> > Hi,
> > it's probably beyond that. it may be an issue with the number of files
> > Kafka can have opened concurrently.
> > A previous conversation with Joe about (build failes for latest stable
> > source tgz (kafka_2.9.2-0.8.1.1)) turned out to discuss this (Q's by Joe,
> > A's by me):
> >
> > 1. what else on the logs? [*see below*]
> > 2. other broker failure reason? [*"*]
> > 3. other broker failure after taking leadership? [*how can I be sure? ask
> > another to describe topic?*]
> > 4. how do I measure number of connections? [*ls -l /proc/<pid>/fd | grep
> > socket | wc -l, also did watch on that*]
> > 5. is that number equals the number of {new Producer}? [*yes*]
> > 6. how many topics? [*1*] how many partitions [*504*]
> > 7. Are u using a partition key? [*yes, I use the python client with* ]
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *class ProducerIdPartitioner(Partitioner):    """    Implements a
> > partitioner which selects the target partition based on the sending
> > producer ID    """    def partition(self, key, partitions):        size =
> > len(partitions)        prod_id = int(key)        idx = prod_id %
> > size        return partitions[idx]*
> > 8. maybe running into over partitioned topic? [*producer instances is 6
> > machines * 84 procs * 24 threads, but never got to start them all*,*b/c
> of
> > errors*]
> > 9. r u running anything else? [*yes, zookeeper*]
> >
> >
> > answer to 1,2:
> > the error's I see on the python client are first timeouts and then
> message
> > send failures, using sync send.
> >
> > on the controller log:
> >
> > ontroller.log.2014-08-26-13:[2014-08-26 13:40:44,317] ERROR
> > [Controller-1-to-broker-3-send-thread], Controller 1 epoch 3 failed to
> send
> > StopReplica request with correlation id 519 to broker
> > id:3,host:shlomi-kafka-broker-3,port:9092. Reconnecting to broker.
> > (kafka.controller.RequestSendThread)
> > controller.log.2014-08-26-13:[2014-08-26 13:40:44,319] ERROR
> > [Controller-1-to-broker-3-send-thread], Controller 1's connection to
> broker
> > id:3,host:shlomi-kafka-broker-3,port:9092 was unsuccessful
> > (kafka.controller.RequestSendThread)
> >
> > on the server log (selected greps):
> > ...
> > server.log.2014-08-27-01:[2014-08-27 01:44:23,143] ERROR
> > [ReplicaFetcherThread-4-2], Error for partition [vpq_android_gcm_h,270]
> to
> > broker 2:class kafka.common.NotLeaderForPartitionException
> > (kafka.server.ReplicaFetcherThread)
> > ...
> > server.log.2014-08-27-12:[2014-08-27 12:08:34,638] ERROR Closing socket
> for
> > /10.184.150.54 because of error (kafka.network.Processor)
> >
> > ...
> > server.log.2014-08-28-07:[2014-08-28 07:57:35,944] ERROR [KafkaApi-1]
> > Error
> > when processing fetch request for partition [vpq_android_gcm_h,184]
> offset
> > 8798 from follower with correlation id 0 (kafka.server.KafkaApis)
> > ...
> > erver.log.2014-09-03-15:[2014-09-03 15:46:18,220] ERROR
> > [ReplicaFetcherThread-2-3], Error in fetch Name: FetchRequest; Version:
> 0;
> > CorrelationId: 177593; ClientId: ReplicaFetcherThread-2-3; ReplicaId: 1;
> > MaxWait: 1000 ms; MinBytes: 1 bytes; RequestInfo: [vpq_android_gcm_h,196]
> > -> PartitionFetchInfo(65283,8388608),[vpq_android_gcm_h,76] ->
> > PartitionFetchInfo(262787,8388608),[vpq_android_gcm_h,460] ->
> > PartitionFetchInfo(285709,8388608),[vpq_android_gcm_h,100] ->
> > PartitionFetchInfo(199405,8388608),[vpq_android_gcm_h,148] ->
> > PartitionFetchInfo(339032,8388608),[vpq_android_gcm_h,436] ->
> > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,124] ->
> > PartitionFetchInfo(484447,8388608),[vpq_android_gcm_h,484] ->
> > PartitionFetchInfo(105945,8388608),[vpq_android_gcm_h,340] ->
> > PartitionFetchInfo(0,8388608),[vpq_android_gcm_h,388] ->
> > PartitionFetchInfo(9,8388608),[vpq_android_gcm_h,316] ->
> > PartitionFetchInfo(194766,8388608),[vpq_android_gcm_h,364] ->
> > PartitionFetchInfo(139897,8388608),[vpq_android_gcm_h,292] ->
> > PartitionFetchInfo(195408,8388608),[vpq_android_gcm_h,28] ->
> > PartitionFetchInfo(329961,8388608),[vpq_android_gcm_h,172] ->
> > PartitionFetchInfo(436959,8388608),[vpq_android_gcm_h,268] ->
> > PartitionFetchInfo(59827,8388608),[vpq_android_gcm_h,244] ->
> > PartitionFetchInfo(259731,8388608),[vpq_android_gcm_h,220] ->
> > PartitionFetchInfo(61669,8388608),[vpq_android_gcm_h,412] ->
> > PartitionFetchInfo(563609,8388608),[vpq_android_gcm_h,4] ->
> > PartitionFetchInfo(360336,8388608),[vpq_android_gcm_h,52] ->
> > PartitionFetchInfo(378533,8388608) (kafka.server.ReplicaFetcherThread)
> > ...
> > server.log.2014-09-03-14:[2014-09-03 14:04:18,548] ERROR Error in
> acceptor
> > (kafka.network.Acceptor)
> > ...
> >
> >
> > and these may not be all (other logs may have some more of that)....
> >
> >
> > Joe said to just lower the number of connections but I still can't see
> the
> > exact problem.
> > is there a kafka limit to the number of concurrent open files? cause the
> > process was not limited...
> >
> > Thanks,
> > Shlomi
> >
> > On Tue, Sep 9, 2014 at 7:12 AM, Jun Rao <jun...@gmail.com> wrote:
> >
> > > What type of error did you see? You may need to configure a larger open
> > > file handler limit.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Sep 3, 2014 at 12:01 PM, Shlomi Hazan <hzshl...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trying to load a cluster with over than 10K connections, and
> > bumped
> > > > into the error in the subject.
> > > > Is there any limitation on Kafka's side? if so it configurable? how?
> > > > on first look, it looks like the selector accepting the connection is
> > > > overflowing...
> > > >
> > > > Thanks.
> > > > --
> > > > Shlomi
> > > >
> > >
> >
>

Reply via email to