Thanks, for following up on this, Tony.
It's always super helpful to hear how things get resolved.
-D

On Fri, Jul 16, 2021 at 12:14 PM Tony John <tonyjohnant...@gmail.com> wrote:

> Hi All,
>
> An update on this. Finally I could figure out the cause for this. I have a
> consumer with *MAX_POLL_INTERVAL_MS_CONFIG* set to *Integer.MAX_VALUE*,
> which was causing the problem. Looks like its a combination of
> *group.initial.rebalance.delay.ms
> <http://group.initial.rebalance.delay.ms>* in kafka + the *
> max.poll.interval.ms
> <http://max.poll.interval.ms>* causing the *Rebalance failed.
> org.apache.kafka.common.errors.DisconnectException*. After debugging I
> could see the below line from AbstractCoordinator class (line #337), which
> leads integer overflow if *max.poll.interval.ms
> <http://max.poll.interval.ms>* is greater than (Integer.MAX_VALUE - 5000)
> and thus *joinGroupTimeoutMs* defaults the request timeout. Now if
> *request.time.out* is less than *group.initial.rebalance.delay.ms
> <http://group.initial.rebalance.delay.ms>* then the issue occurs. Let me
> know what you think. For now I can get away with changing the
> max.poll.interval.ms
>
> AbstractCoordinator #337
> int joinGroupTimeoutMs = Math.max(this.client.defaultRequestTimeoutMs(),
> this.rebalanceConfig.rebalanceTimeoutMs + 5000);
>
> Thanks,
> Tony
>
>
>
> On Wed, Jul 14, 2021 at 10:56 PM Tony John <tonyjohnant...@gmail.com>
> wrote:
>
> > Hi Shilin,
> >
> > Thanks for the suggestion. But I am not upgrading an existing cluster.
> > I've got a fresh broker and application cluster and there are no consumer
> > offsets or topics present. When the app starts it creates the topics and
> > once it moves to RUNNING state I see the rebalance failed log every 30
> > seconds. As I understand, the steps in the doc needs to be followed only
> if
> > an existing cluster is being migrated to the new version. Am I missing
> > something here? Below is my KafkaConfig from one the broker during start
> > up.
> >
> >
> > [2021-07-14 07:27:06,271] INFO KafkaConfig values:
> >         advertised.host.name = null
> >         advertised.listeners = PLAINTEXT://broker100:9092
> >         advertised.port = null
> >         alter.config.policy.class.name = null
> >         alter.log.dirs.replication.quota.window.num = 11
> >         alter.log.dirs.replication.quota.window.size.seconds = 1
> >         authorizer.class.name =
> >         auto.create.topics.enable = true
> >         auto.leader.rebalance.enable = true
> >         background.threads = 10
> >         broker.id = 100
> >         broker.id.generation.enable = true
> >         broker.rack = null
> >         client.quota.callback.class = null
> >         compression.type = producer
> >         connection.failed.authentication.delay.ms = 100
> >         connections.max.idle.ms = 1080000
> >         connections.max.reauth.ms = 0
> >         control.plane.listener.name = null
> >         controlled.shutdown.enable = true
> >         controlled.shutdown.max.retries = 3
> >         controlled.shutdown.retry.backoff.ms = 5000
> >         controller.quota.window.num = 11
> >         controller.quota.window.size.seconds = 1
> >         controller.socket.timeout.ms = 30000
> >         create.topic.policy.class.name = null
> >         default.replication.factor = 1
> >         delegation.token.expiry.check.interval.ms = 3600000
> >         delegation.token.expiry.time.ms = 86400000
> >         delegation.token.master.key = null
> >         delegation.token.max.lifetime.ms = 604800000
> >         delete.records.purgatory.purge.interval.requests = 1
> >         delete.topic.enable = true
> >         fetch.max.bytes = 57671680
> >         fetch.purgatory.purge.interval.requests = 1000
> >         group.initial.rebalance.delay.ms = 120000
> >         group.max.session.timeout.ms = 1200000
> >         group.max.size = 2147483647
> >         group.min.session.timeout.ms = 6000
> >         host.name =
> >         inter.broker.listener.name = null
> >         inter.broker.protocol.version = 2.7-IV2
> >         kafka.metrics.polling.interval.secs = 10
> >         kafka.metrics.reporters = []
> >         leader.imbalance.check.interval.seconds = 300
> >         leader.imbalance.per.broker.percentage = 10
> >         listener.security.protocol.map =
> >
> PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> >         listeners = PLAINTEXT://broker100:9092
> >         log.cleaner.backoff.ms = 15000
> >         log.cleaner.dedupe.buffer.size = 134217728
> >         log.cleaner.delete.retention.ms = 86400000
> >         log.cleaner.enable = true
> >         log.cleaner.io.buffer.load.factor = 0.9
> >         log.cleaner.io.buffer.size = 524288
> >         log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
> >         log.cleaner.max.compaction.lag.ms = 9223372036854775807
> >         log.cleaner.min.cleanable.ratio = 0.5
> >         log.cleaner.min.compaction.lag.ms = 0
> >         log.cleaner.threads = 1
> >         log.cleanup.policy = [delete]
> >         log.dir = /tmp/kafka-logs
> >         log.dirs = /mnt/store/latest/kafka/kafka-logs
> >         log.flush.interval.messages = 9223372036854775807
> >         log.flush.interval.ms = null
> >         log.flush.offset.checkpoint.interval.ms = 60000
> >         log.flush.scheduler.interval.ms = 9223372036854775807
> >         log.flush.start.offset.checkpoint.interval.ms = 60000
> >         log.index.interval.bytes = 4096
> >         log.index.size.max.bytes = 10485760
> >         log.message.downconversion.enable = true
> >         log.message.format.version = 2.7-IV2
> >         log.message.timestamp.difference.max.ms = 9223372036854775807
> >         log.message.timestamp.type = CreateTime
> >         log.preallocate = false
> >         log.retention.bytes = -1
> >         log.retention.check.interval.ms = 300000
> >         log.retention.hours = 1
> >         log.retention.minutes = null
> >         log.retention.ms = null
> >         log.roll.hours = 168
> >         log.roll.jitter.hours = 0
> >         log.roll.jitter.ms = null
> >         log.roll.ms = null
> >         log.segment.bytes = 1073741824
> >         log.segment.delete.delay.ms = 60000
> >         max.connection.creation.rate = 2147483647
> >         max.connections = 2147483647
> >         max.connections.per.ip = 2147483647
> >         max.connections.per.ip.overrides =
> >         max.incremental.fetch.session.cache.slots = 1000
> >         message.max.bytes = 31457280
> >         metric.reporters = []
> >         metrics.num.samples = 2
> >         metrics.recording.level = INFO
> >         metrics.sample.window.ms = 30000
> >         min.insync.replicas = 1
> >         num.io.threads = 4
> >         num.network.threads = 4
> >         num.partitions = 1
> >         num.recovery.threads.per.data.dir = 4
> >         num.replica.alter.log.dirs.threads = null
> >         num.replica.fetchers = 1
> >         offset.metadata.max.bytes = 4096
> >         offsets.commit.required.acks = -1
> >         offsets.commit.timeout.ms = 5000
> >         offsets.load.buffer.size = 5242880
> >         offsets.retention.check.interval.ms = 600000
> >         offsets.retention.minutes = 10080
> >         offsets.topic.compression.codec = 0
> >         offsets.topic.num.partitions = 50
> >         offsets.topic.replication.factor = 2
> >         offsets.topic.segment.bytes = 104857600
> >         password.encoder.cipher.algorithm = AES/CBC/PKCS5Padding
> >         password.encoder.iterations = 4096
> >         password.encoder.key.length = 128
> >         password.encoder.keyfactory.algorithm = null
> >         password.encoder.old.secret = null
> >         password.encoder.secret = null
> >         port = 9092
> >         principal.builder.class = null
> >         producer.purgatory.purge.interval.requests = 1000
> >         queued.max.request.bytes = -1
> >         queued.max.requests = 500
> >         quota.consumer.default = 9223372036854775807
> >         quota.producer.default = 9223372036854775807
> >         quota.window.num = 11
> >         quota.window.size.seconds = 1
> >         replica.fetch.backoff.ms = 1000
> >         replica.fetch.max.bytes = 31457280
> >         replica.fetch.min.bytes = 1
> >         replica.fetch.response.max.bytes = 10485760
> >         replica.fetch.wait.max.ms = 500
> >         replica.high.watermark.checkpoint.interval.ms = 5000
> >         replica.lag.time.max.ms = 30000
> >         replica.selector.class = null
> >         replica.socket.receive.buffer.bytes = 65536
> >         replica.socket.timeout.ms = 30000
> >         replication.quota.window.num = 11
> >         replication.quota.window.size.seconds = 1
> >         request.timeout.ms = 600000
> >         reserved.broker.max.id = 1000
> >         sasl.client.callback.handler.class = null
> >         sasl.enabled.mechanisms = [GSSAPI]
> >         sasl.jaas.config = null
> >         sasl.kerberos.kinit.cmd = /usr/bin/kinit
> >         sasl.kerberos.min.time.before.relogin = 60000
> >         sasl.kerberos.principal.to.local.rules = [DEFAULT]
> >         sasl.kerberos.service.name = null
> >         sasl.kerberos.ticket.renew.jitter = 0.05
> >         sasl.kerberos.ticket.renew.window.factor = 0.8
> >         sasl.login.callback.handler.class = null
> >         sasl.login.class = null
> >         sasl.login.refresh.buffer.seconds = 300
> >         sasl.login.refresh.min.period.seconds = 60
> >         sasl.login.refresh.window.factor = 0.8
> >         sasl.login.refresh.window.jitter = 0.05
> >         sasl.mechanism.inter.broker.protocol = GSSAPI
> >         sasl.server.callback.handler.class = null
> >         security.inter.broker.protocol = PLAINTEXT
> >         security.providers = null
> >         socket.connection.setup.timeout.max.ms = 127000
> >         socket.connection.setup.timeout.ms = 10000
> >         socket.receive.buffer.bytes = 102400
> >         socket.request.max.bytes = 104857600
> >         socket.send.buffer.bytes = 102400
> >         ssl.cipher.suites = []
> >         ssl.client.auth = none
> >         ssl.enabled.protocols = [TLSv1.2]
> >         ssl.endpoint.identification.algorithm = https
> >         ssl.engine.factory.class = null
> >         ssl.key.password = null
> >         ssl.keymanager.algorithm = SunX509
> >         ssl.keystore.certificate.chain = null
> >         ssl.keystore.key = null
> >         ssl.keystore.location = null
> >         ssl.keystore.password = null
> >         ssl.keystore.type = JKS
> >         ssl.principal.mapping.rules = DEFAULT
> >         ssl.protocol = TLSv1.2
> >         ssl.provider = null
> >         ssl.secure.random.implementation = null
> >         ssl.trustmanager.algorithm = PKIX
> >         ssl.truststore.certificates = null
> >         ssl.truststore.location = null
> >         ssl.truststore.password = null
> >         ssl.truststore.type = JKS
> >         transaction.abort.timed.out.transaction.cleanup.interval.ms =
> > 10000
> >         transaction.max.timeout.ms = 900000
> >         transaction.remove.expired.transaction.cleanup.interval.ms =
> > 3600000
> >         transaction.state.log.load.buffer.size = 5242880
> >         transaction.state.log.min.isr = 2
> >         transaction.state.log.num.partitions = 50
> >         transaction.state.log.replication.factor = 2
> >         transaction.state.log.segment.bytes = 104857600
> >         transactional.id.expiration.ms = 604800000
> >         unclean.leader.election.enable = false
> >         zookeeper.clientCnxnSocket = null
> >         zookeeper.connect =
> >
> broker100:2181,broker101:2181,broker102:2181,broker103:2181,broker104:2181
> >         zookeeper.connection.timeout.ms = 30000
> >         zookeeper.max.in.flight.requests = 10
> >         zookeeper.session.timeout.ms = 18000
> >         zookeeper.set.acl = false
> >         zookeeper.ssl.cipher.suites = null
> >         zookeeper.ssl.client.enable = false
> >         zookeeper.ssl.crl.enable = false
> >         zookeeper.ssl.enabled.protocols = null
> >         zookeeper.ssl.endpoint.identification.algorithm = HTTPS
> >         zookeeper.ssl.keystore.location = null
> >         zookeeper.ssl.keystore.password = null
> >         zookeeper.ssl.keystore.type = null
> >         zookeeper.ssl.ocsp.enable = false
> >         zookeeper.ssl.protocol = TLSv1.2
> >         zookeeper.ssl.truststore.location = null
> >         zookeeper.ssl.truststore.password = null
> >         zookeeper.ssl.truststore.type = null
> >         zookeeper.sync.time.ms = 2000
> >
> > Thanks,
> > Tony
> >
> > On Wed, Jul 14, 2021 at 4:58 PM Shilin Wu <s...@confluent.io.invalid>
> > wrote:
> >
> >> Depending on your original version, you may have to consult the upgrade
> >> guide.
> >> https://kafka.apache.org/27/documentation.html#upgrade
> >>
> >> Didn't see important compatibility settings like:
> >> [image: image.png]
> >>
> >>
> >> Perhaps you are not doing it correctly.
> >>
> >>
> >> [image: Confluent] <https://www.confluent.io>
> >> Wu Shilin
> >> Solution Architect
> >> +6581007012
> >> Follow us: [image: Blog]
> >> <
> https://www.confluent.io/blog?utm_source=footer&utm_medium=email&utm_campaign=ch.email-signature_type.community_content.blog
> >[image:
> >> Twitter] <https://twitter.com/ConfluentInc>[image: LinkedIn]
> >> <https://www.linkedin.com/company/confluent/>[image: Slack]
> >> <https://slackpass.io/confluentcommunity>[image: YouTube]
> >> <https://youtube.com/confluent>
> >> [image: Kafka Summit] <https://www.kafka-summit.org/>
> >>
> >>
> >> On Wed, Jul 14, 2021 at 7:21 PM Tony John <tonyjohnant...@gmail.com>
> >> wrote:
> >>
> >>> Can someone help me on this.
> >>>
> >>> Thanks,
> >>> Tony
> >>>
> >>> On Fri, Jul 9, 2021 at 8:15 PM Tony John <tonyjohnant...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi All,
> >>> >
> >>> > I am trying to upgrade my Kafka streams application to 2.7.1 version
> of
> >>> > Kafka. The brokers are upgraded to 2.7.1 and kafka dependencies are
> >>> also on
> >>> > 2.7.1. But when I start the application, rebalance is failing with
> the
> >>> > following message
> >>> >
> >>> > Rebalance failed. org.apache.kafka.common.errors.DisconnectException
> >>> >
> >>> > I am also seeing Group coordinator broker102:9092 (id: 2147483645
> rack:
> >>> > null) is unavailable or invalid due to cause: coordinator
> >>> > unavailable.isDisconnected: false. Rediscovery will be attempted.
> >>> >
> >>> > The full set of logs (which gets printed every 30 seconds) is given
> >>> below
> >>> >
> >>> > INFO  2021-07-09 09:33:20.229 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] Group
> >>> coordinator
> >>> > broker102:9092 (id: 2147483645 rack: null) is unavailable or invalid
> >>> due to
> >>> > cause: null.isDisconnected: true. Rediscovery will be attempted.
> >>> > INFO  2021-07-09 09:33:20.230 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] Discovered
> >>> group
> >>> > coordinator broker102:9092 (id: 2147483645 rack: null)
> >>> > INFO  2021-07-09 09:33:20.230 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] Group
> >>> coordinator
> >>> > broker102:9092 (id: 2147483645 rack: null) is unavailable or invalid
> >>> due to
> >>> > cause: coordinator unavailable.isDisconnected: false. Rediscovery
> will
> >>> be
> >>> > attempted.
> >>> > INFO  2021-07-09 09:33:20.330 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] Discovered
> >>> group
> >>> > coordinator broker102:9092 (id: 2147483645 rack: null)
> >>> > INFO  2021-07-09 09:33:20.331 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] Rebalance
> >>> failed.
> >>> > org.apache.kafka.common.errors.DisconnectException
> >>> > INFO  2021-07-09 09:33:20.331 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] (Re-)joining
> >>> group
> >>> > INFO  2021-07-09 09:33:20.333 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-task-coordinator-consumer-app-node100-9,
> >>> > groupId=my-app-v1-task-coordinator-consumer-app-node100] (Re-)joining
> >>> group
> >>> > INFO  2021-07-09 09:33:20.419 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] Group coordinator
> >>> > broker101:9092 (id: 2147483646 rack: null) is unavailable or invalid
> >>> due to
> >>> > cause: null.isDisconnected: true. Rediscovery will be attempted.
> >>> > INFO  2021-07-09 09:33:20.420 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] Discovered group
> >>> coordinator
> >>> > broker101:9092 (id: 2147483646 rack: null)
> >>> > INFO  2021-07-09 09:33:20.420 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] Group coordinator
> >>> > broker101:9092 (id: 2147483646 rack: null) is unavailable or invalid
> >>> due to
> >>> > cause: coordinator unavailable.isDisconnected: false. Rediscovery
> will
> >>> be
> >>> > attempted.
> >>> > INFO  2021-07-09 09:33:20.521 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] Discovered group
> >>> coordinator
> >>> > broker101:9092 (id: 2147483646 rack: null)
> >>> > INFO  2021-07-09 09:33:20.522 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] Rebalance failed.
> >>> > org.apache.kafka.common.errors.DisconnectException
> >>> > INFO  2021-07-09 09:33:20.523 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] (Re-)joining group
> >>> > INFO  2021-07-09 09:33:20.524 | internals.AbstractCoordinator
> [Consumer
> >>> > clientId=consumer-my-app-v1-master-coordinator-consumer-8,
> >>> > groupId=my-app-v1-master-coordinator-consumer] (Re-)joining group
> >>> >
> >>> > The application was working fine on 2.5.1. Please note with 2.5.1 the
> >>> > build used was kafka_*2.12*-2.5.1, but with 2.7.1 I used kafka_*2.13*
> >>> > -2.7.1
> >>> >
> >>> > *The broker config is *
> >>> >
> >>> > broker.id=2
> >>> > listeners=PLAINTEXT://broker102:9092
> >>> > advertised.listeners=PLAINTEXT://broker102:9092
> >>> > num.network.threads=4
> >>> > num.io.threads=4
> >>> > socket.send.buffer.bytes=102400
> >>> > socket.receive.buffer.bytes=102400
> >>> > socket.request.max.bytes=104857600
> >>> > log.dirs=/mnt/store/kafka/kafka-logs
> >>> > num.partitions=1
> >>> > num.recovery.threads.per.data.dir=4
> >>> > offsets.topic.replication.factor=2
> >>> > transaction.state.log.replication.factor=2
> >>> > transaction.state.log.min.isr=2
> >>> > log.retention.hours=1
> >>> > log.segment.bytes=1073741824
> >>> > log.retention.check.interval.ms=300000
> >>> >
> >>> >
> >>>
> zookeeper.connect=broker100:2181,broker101:2181,broker102:2181,broker103:2181,broker104:2181
> >>> > zookeeper.connection.timeout.ms=30000
> >>> > group.initial.rebalance.delay.ms=120000
> >>> > offsets.retention.minutes=10080
> >>> > message.max.bytes=31457280
> >>> > replica.fetch.max.bytes=31457280
> >>> > group.max.session.timeout.ms=1200000
> >>> > request.timeout.ms=600000
> >>> > connections.max.idle.ms=1080000
> >>> >
> >>> > What could be wrong? Should I switch to kafka_*2.12*-2.7.1 ?
> >>> >
> >>> > Thanks,
> >>> > Tony
> >>> >
> >>> >
> >>>
> >>
>


-- 

*Daniel Meyer*Full Stack Developer
[image: !K7 Music] <http://www.k7.com/>
Label Group: 7K! | !K7 Records | AUS | Strut | Soul Bank

K7 Music GmbH
Gerichtstr. 35
13347 Berlin, Germany

+49 30 4690505 | k7.com <http://www.k7.com/>

CEO / Geschaeftsfuehrer: Horst Weidenmueller, Registered at
Amtsgericht Charlottenburg HRB 60789 VAT-ID / USt-id-Nr: DE 812184824

Reply via email to