hi Kostas, Copy pasting this snippet where we see the fluctuations. Let me know if this helps.
2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Node 3 disconnected. 2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null) for sending metadata request 2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null) 2020-09-22 23:39:19,664 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984310 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984311, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1516) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984311 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Sending metadata request (type=MetadataRequest, topics=captchastream) to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.common.network.Selector - Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 4 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 4. Fetching API versions. 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Initiating API versions fetch from node 4. 2020-09-22 23:39:19,666 DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 319 to Cluster(id = 4ou4oBz8TU24ipwW8ws1Bw, nodes = [be-kafka-dragonpit-broker-6:8017 (id: 6 rack: null), be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null), be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null), be-kafka-dragonpit-broker-3:8017 (id: 3 rack: null), be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null), be-kafka-dragonpit-broker-7:8017 (id: 7 rack: null)], partitions = [Partition(topic = captchastream, partition = 8, leader = 7, replicas = [7,3], isr = [7,3]), Partition(topic = captchastream, partition = 9, leader = 8, replicas = [8,4], isr = [4,8]), Partition(topic = captchastream, partition = 4, leader = 3, replicas = [3,4], isr = [3,4]), Partition(topic = captchastream, partition = 5, leader = 4, replicas = [4,5], isr = [4,5]), Partition(topic = captchastream, partition = 6, leader = 5, replicas = [5,7], isr = [5,7]), Partition(topic = captchastream, partition = 7, leader = 6, replicas = [6,8], isr = [8,6]), Partition(topic = captchastream, partition = 0, leader = 5, replicas = [5,6], isr = [5,6]), Partition(topic = captchastream, partition = 1, leader = 6, replicas = [6,7], isr = [7,6]), Partition(topic = captchastream, partition = 2, leader = 7, replicas = [7,8], isr = [7,8]), Partition(topic = captchastream, partition = 3, leader = 8, replicas = [8,3], isr = [8,3])]) 2020-09-22 23:39:19,666 DEBUG org.apache.kafka.clients.NetworkClient - Recorded API versions for node 4: (Produce(0): 0 to 5 [usable: 3], Fetch(1): 0 to 7 [usable: 5], Offsets(2): 0 to 2 [usable: 2], Metadata(3): 0 to 5 [usable: 4], LeaderAndIsr(4): 0 to 1 [usable: 0], StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 4 [usable: 3], ControlledShutdown(7): 0 to 1 [usable: 1], OffsetCommit(8): 0 to 3 [usable: 3], OffsetFetch(9): 0 to 3 [usable: 3], FindCoordinator(10): 0 to 1 [usable: 1], JoinGroup(11): 0 to 2 [usable: 2], Heartbeat(12): 0 to 1 [usable: 1], LeaveGroup(13): 0 to 1 [usable: 1], SyncGroup(14): 0 to 1 [usable: 1], DescribeGroups(15): 0 to 1 [usable: 1], ListGroups(16): 0 to 1 [usable: 1], SaslHandshake(17): 0 to 1 [usable: 0], ApiVersions(18): 0 to 1 [usable: 1], CreateTopics(19): 0 to 2 [usable: 2], DeleteTopics(20): 0 to 1 [usable: 1], DeleteRecords(21): 0 [usable: 0], InitProducerId(22): 0 [usable: 0], OffsetForLeaderEpoch(23): 0 [usable: 0], AddPartitionsToTxn(24): 0 [usable: 0], AddOffsetsToTxn(25): 0 [usable: 0], EndTxn(26): 0 [usable: 0], WriteTxnMarkers(27): 0 [usable: 0], TxnOffsetCommit(28): 0 [usable: 0], DescribeAcls(29): 0 [usable: 0], CreateAcls(30): 0 [usable: 0], DeleteAcls(31): 0 [usable: 0], DescribeConfigs(32): 0 to 1 [usable: 0], AlterConfigs(33): 0 [usable: 0], UNKNOWN(34): 0, UNKNOWN(35): 0, UNKNOWN(36): 0, UNKNOWN(37): 0, UNKNOWN(38): 0, UNKNOWN(39): 0, UNKNOWN(40): 0, UNKNOWN(41): 0, UNKNOWN(42): 0) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984311 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984312, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=3479) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984312 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984312 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984313, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1523) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984313 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:20,239 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984313 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984314, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1296) 2020-09-22 23:39:20,239 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984314 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Node 4 disconnected. 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989827 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989828, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1019) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989828 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null) for sending metadata request 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null) 2020-09-22 23:48:19,683 DEBUG org.apache.kafka.common.network.Selector - Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 8 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 8. Fetching API versions. 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Initiating API versions fetch from node 8. 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Sending metadata request (type=MetadataRequest, topics=captchastream) to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,685 DEBUG org.apache.kafka.clients.NetworkClient - Recorded API versions for node 8: (Produce(0): 0 to 5 [usable: 3], Fetch(1): 0 to 7 [usable: 5], Offsets(2): 0 to 2 [usable: 2], Metadata(3): 0 to 5 [usable: 4], LeaderAndIsr(4): 0 to 1 [usable: 0], StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 4 [usable: 3], ControlledShutdown(7): 0 to 1 [usable: 1], OffsetCommit(8): 0 to 3 [usable: 3], OffsetFetch(9): 0 to 3 [usable: 3], FindCoordinator(10): 0 to 1 [usable: 1], JoinGroup(11): 0 to 2 [usable: 2], Heartbeat(12): 0 to 1 [usable: 1], LeaveGroup(13): 0 to 1 [usable: 1], SyncGroup(14): 0 to 1 [usable: 1], DescribeGroups(15): 0 to 1 [usable: 1], ListGroups(16): 0 to 1 [usable: 1], SaslHandshake(17): 0 to 1 [usable: 0], ApiVersions(18): 0 to 1 [usable: 1], CreateTopics(19): 0 to 2 [usable: 2], DeleteTopics(20): 0 to 1 [usable: 1], DeleteRecords(21): 0 [usable: 0], InitProducerId(22): 0 [usable: 0], OffsetForLeaderEpoch(23): 0 [usable: 0], AddPartitionsToTxn(24): 0 [usable: 0], AddOffsetsToTxn(25): 0 [usable: 0], EndTxn(26): 0 [usable: 0], WriteTxnMarkers(27): 0 [usable: 0], TxnOffsetCommit(28): 0 [usable: 0], DescribeAcls(29): 0 [usable: 0], CreateAcls(30): 0 [usable: 0], DeleteAcls(31): 0 [usable: 0], DescribeConfigs(32): 0 to 1 [usable: 0], AlterConfigs(33): 0 [usable: 0], UNKNOWN(34): 0, UNKNOWN(35): 0, UNKNOWN(36): 0, UNKNOWN(37): 0, UNKNOWN(38): 0, UNKNOWN(39): 0, UNKNOWN(40): 0, UNKNOWN(41): 0, UNKNOWN(42): 0) 2020-09-22 23:48:19,685 DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 321 to Cluster(id = 4ou4oBz8TU24ipwW8ws1Bw, nodes = [be-kafka-dragonpit-broker-6:8017 (id: 6 rack: null), be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null), be-kafka-dragonpit-broker-3:8017 (id: 3 rack: null), be-kafka-dragonpit-broker-7:8017 (id: 7 rack: null), be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null), be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null)], partitions = [Partition(topic = captchastream, partition = 8, leader = 7, replicas = [7,3], isr = [7,3]), Partition(topic = captchastream, partition = 9, leader = 8, replicas = [8,4], isr = [4,8]), Partition(topic = captchastream, partition = 4, leader = 3, replicas = [3,4], isr = [3,4]), Partition(topic = captchastream, partition = 5, leader = 4, replicas = [4,5], isr = [4,5]), Partition(topic = captchastream, partition = 6, leader = 5, replicas = [5,7], isr = [5,7]), Partition(topic = captchastream, partition = 7, leader = 6, replicas = [6,8], isr = [8,6]), Partition(topic = captchastream, partition = 0, leader = 5, replicas = [5,6], isr = [5,6]), Partition(topic = captchastream, partition = 1, leader = 6, replicas = [6,7], isr = [7,6]), Partition(topic = captchastream, partition = 2, leader = 7, replicas = [7,8], isr = [7,8]), Partition(topic = captchastream, partition = 3, leader = 8, replicas = [8,3], isr = [8,3])]) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989828 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989829, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=3489) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989829 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,902 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989829 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989830, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1736) 2020-09-22 23:48:19,903 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989830 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) On Wed, Sep 23, 2020 at 3:08 PM Kostas Kloudas <kklou...@gmail.com> wrote: > Hi Ramya, > > Unfortunately I cannot see them. > > Kostas > > On Wed, Sep 23, 2020 at 10:27 AM Ramya Ramamurthy <hair...@gmail.com> > wrote: > > > > Hi Kostas, > > > > Attaching the taskmanager logs regarding this issue. > > I have attached the Kaka related metrics. I hope you can see it this > time. > > > > Not sure why we get these many disconnects to Kafka. Maybe because of > this interruptions, we seem to slow down on our processing. At some point > the memory also increases and the workers almost stagnate not doing any > processing. I have 3GB heap committed and allotted 5GB memory to the pods. > > > > Thanks for your help. > > > > ~Ramya. > > > > On Tue, Sep 22, 2020 at 9:18 PM Kostas Kloudas <kklou...@gmail.com> > wrote: > >> > >> Hi Ramya, > >> > >> Unfortunately your images are blocked. Could you upload them somewhere > and > >> post the links here? > >> Also I think that the TaskManager logs may be able to help a bit more. > >> Could you please provide them here? > >> > >> Cheers, > >> Kostas > >> > >> On Tue, Sep 22, 2020 at 8:58 AM Ramya Ramamurthy <hair...@gmail.com> > wrote: > >> > >> > Hi, > >> > > >> > We are seeing an issue with Flink on our production. The version is > 1.7 > >> > which we use. > >> > We started seeing sudden lag on kafka, and the consumers were no > longer > >> > working/accepting messages. On trying to enable debug mode, the below > >> > errors were seen > >> > [image: image.jpeg] > >> > > >> > I am not sure why this occurs everyday and when this happens, I can > see > >> > the remaining workers arent able to handle the load. Unless i restart > my > >> > jobs, i am unable to start processing again. This way, there is data > loss > >> > as well. > >> > > >> > On the below graph, there is a slight dip in consumption before 5:30. > That > >> > is when this incident happens and correlated with logs. > >> > > >> > [image: image.jpeg] > >> > > >> > Any pointers/suggestions would be appreciated. > >> > > >> > Thanks. > >> > > >> > >