Re: Kafka-streams process stopped processing messages

Guozhang Wang Tue, 16 May 2017 15:29:06 -0700

Hi Shimi,

Just to clarify, your scenario is that 1) you shutdown the single-instance
app, cleanly, 2) you restart the app with a single-instance, and that is
taking hours, right?


Did you use any in-memory stores (i.e. not RocksDB) in your topology?


Guozhang


On Tue, May 16, 2017 at 1:08 AM, Eno Thereska <eno.there...@gmail.com>
wrote:

> Hi Shimi,
>
> Could we start a new email thread on the slow booting to separate it from
> the initial thread (call it "slow boot" or something)? Thank you. Also,
> could you provide the logs for the booting part if possible, together with
> your streams config.
>
> Thanks
> Eno
> > On 15 May 2017, at 20:49, Shimi Kiviti <shim...@gmail.com> wrote:
> >
> > I do run the clients with 0.10.2.1 and it takes hours
> > What I don't understand is why it takes hours to boot on a server that
> has
> > all the data in RocksDB already. Is that related to the amount of data in
> > RocksDB (changelog topics) or the data in the source topic the processes
> > reads from?
> > On Mon, 15 May 2017 at 20:32 Guozhang Wang <wangg...@gmail.com> wrote:
> >
> >> Hello Shimi,
> >>
> >> Could you try upgrading your clients to 0.10.2.1 (note you do not need
> to
> >> upgrade your servers if it is already on 0.10.1, since newer Streams
> >> clients can directly talk to older versioned brokers since 0.10.1+) and
> try
> >> it out again? I have a few optimizations to reduce rebalance latencies
> in
> >> both the underlying consumer client as well as streams library, and
> >> hopefully they will help with your rebalance issues.
> >>
> >> Also, we have a bunch of more fixes on consumer rebalance that we have
> >> already pushed in trunk and hence will be included in the upcoming June
> >> release of 0.11.0.0.
> >>
> >>
> >> Guozhang
> >>
> >> On Sat, May 13, 2017 at 12:32 PM, Shimi Kiviti <shim...@gmail.com>
> wrote:
> >>
> >>> I tried all these configurations and now like version 0.10.1.1 I see a
> >> very
> >>> slow startup.
> >>> I decreased the cluster to a single server which was running without
> any
> >>> problem for a few hours. Now, each time I restart this process it gets
> >> into
> >>> rebalancing state for several hours.
> >>> That mean that every time we need to deploy a new version of our app
> >> (which
> >>> can be several times a day) we have a down time of hours.
> >>>
> >>>
> >>> On Sat, May 6, 2017 at 5:13 PM, Eno Thereska <eno.there...@gmail.com>
> >>> wrote:
> >>>
> >>>> Yeah we’ve seen cases when the session timeout might also need
> >>> increasing.
> >>>> Could you try upping it to something like 60000ms and let us know how
> >> it
> >>>> goes:
> >>>>
> >>>>>> streamsProps.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60000);
> >>>>
> >>>>
> >>>> Thanks
> >>>> Eno
> >>>>
> >>>>> On May 6, 2017, at 8:35 AM, Shimi Kiviti <shim...@gmail.com> wrote:
> >>>>>
> >>>>> Thanks Eno,
> >>>>> I already set the the recurve buffer size to 1MB
> >>>>> I will also try producer
> >>>>>
> >>>>> What about session timeout and heart beat timeout? Do you think it
> >>> should
> >>>>> be increased?
> >>>>>
> >>>>> Thanks,
> >>>>> Shimi
> >>>>>
> >>>>> On Sat, 6 May 2017 at 0:21 Eno Thereska <eno.there...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hi Shimi,
> >>>>>>
> >>>>>> I’ve noticed with our benchmarks that on AWS environments with high
> >>>>>> network latency the network socket buffers often need adjusting. Any
> >>>> chance
> >>>>>> you could add the following to your streams configuration to change
> >>> the
> >>>>>> default socket size bytes to a higher value (at least 1MB) and let
> >> us
> >>>> know?
> >>>>>>
> >>>>>> private static final int SOCKET_SIZE_BYTES = 1 * 1024 * 1024; // at
> >>>> least
> >>>>>> 1MB
> >>>>>> streamsProps.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG,
> >>>> SOCKET_SIZE_BYTES);
> >>>>>> streamsProps.put(ProducerConfig.SEND_BUFFER_CONFIG,
> >>> SOCKET_SIZE_BYTES);
> >>>>>>
> >>>>>> Thanks
> >>>>>> Eno
> >>>>>>
> >>>>>>> On May 4, 2017, at 3:45 PM, Shimi Kiviti <shim...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> Thanks Eno,
> >>>>>>>
> >>>>>>> We still see problems on our side.
> >>>>>>> when we run kafka-streams 0.10.1.1 eventually the problem goes away
> >>> but
> >>>>>>> with 0.10.2.1 it is not.
> >>>>>>> We see a lot of the rebalancing messages I wrote before
> >>>>>>>
> >>>>>>> on at least 1 kafka-stream nodes we see disconnection messages like
> >>> the
> >>>>>>> following. These messages repeat all the time
> >>>>>>>
> >>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Discovered coordinator
> >>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) for
> >>> group
> >>>> sa.
> >>>>>>> 2017-05-04 14:25:56,063 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Initiating connection to node 2147483646 at
> >>>>>> ip-10-0-91-10.ec2.internal:9092.
> >>>>>>> 2017-05-04 14:25:56,091 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: (Re-)joining group sa
> >>>>>>> 2017-05-04 14:25:56,093 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.s.p.i.StreamPartitionAssignor: stream-thread
> [StreamThread-1]
> >>>> found
> >>>>>>> [sa-events] topics possibly matching regex
> >>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> >>>> o.a.k.s.p.TopologyBuilder:
> >>>>>>> stream-thread [StreamThread-1] updating builder with
> >>>>>>> SubscriptionUpdates{updatedTopicSubscriptions=[sa-events]}
> topic(s)
> >>>> with
> >>>>>> po
> >>>>>>> ssible matching regex subscription(s)
> >>>>>>> 2017-05-04 14:25:56,096 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending JoinGroup ((type:
> >>>>>>> JoinGroupRequest, groupId=sa, sessionTimeout=10000,
> >>>>>>> rebalanceTimeout=2147483647, memb
> >>>>>>> erId=, protocolType=consumer,
> >>>>>>>
> >>>>>> groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$
> >>>> ProtocolMetadata@2f894d9b
> >>>>>> ))
> >>>>>>> to coordinator ip-10-0-91-10.ec2.internal:9092 (id: 2147483646
> >> rack:
> >>>>>> null)
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG o.a.k.c.n.Selector:
> >>>>>> Created
> >>>>>>> socket with SO_RCVBUF = 1048576, SO_SNDBUF = 131072, SO_TIMEOUT = 0
> >>> to
> >>>>>> node
> >>>>>>> 2147483646
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Completed connection to node 2147483646.  Fetching API versions.
> >>>>>>> 2017-05-04 14:25:56,097 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Initiating API versions fetch from node 2147483646.
> >>>>>>> 2017-05-04 14:25:56,104 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Recorded API versions for node 2147483646: (Produce(0): 0 to 2
> >>> [usable:
> >>>>>> 2],
> >>>>>>> Fetch(1): 0 to 3 [usable: 3], Offsets(2): 0 to 1 [usable: 1],
> >>>>>>> Metadata(3): 0 to 2 [usable: 2], LeaderAndIsr(4): 0 [usable: 0],
> >>>>>>> StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 2 [usable:
> >> 2],
> >>>>>>> ControlledShutdown(7): 1 [usable: 1], OffsetCommit(8): 0 to 2
> >>> [usable:
> >>>>>>> 2], OffsetFetch(9): 0 to 1 [usable: 1], GroupCoordinator(10): 0
> >>>> [usable:
> >>>>>>> 0], JoinGroup(11): 0 to 1 [usable: 1], Heartbeat(12): 0 [usable:
> >> 0],
> >>>>>>> LeaveGroup(13): 0 [usable: 0], SyncGroup(14): 0 [usable: 0], Desc
> >>>>>>> ribeGroups(15): 0 [usable: 0], ListGroups(16): 0 [usable: 0],
> >>>>>>> SaslHandshake(17): 0 [usable: 0], ApiVersions(18): 0 [usable: 0],
> >>>>>>> CreateTopics(19): 0 [usable: 0], DeleteTopics(20): 0 [usable: 0])
> >>>>>>> 2017-05-04 14:29:44,800 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Node -2 disconnected.
> >>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Sending metadata request (type=MetadataR
> >>>>>>> equest, topics=) to node 1
> >>>>>>> 2017-05-04 14:29:44,801 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.NetworkClient: Node -1 disconnected.
> >>>>>>> 2017-05-04 14:29:44,802 [kafka-producer-network-thread |
> >>>>>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-producer]
> >>> DEBUG
> >>>>>>> o.a.k.c.Metadata: Updated cluster metadata version 4 to Cluster
> >>>>>>> (id = JsVqjH3tS4CIcqpd2jkogA, nodes = [ip-10-0-91-10.ec2.internal:
> >>> 9092
> >>>>>> (id:
> >>>>>>> 1 rack: null), ip-10-0-95-250.ec2.internal:9092 (id: 2 rack:
> >> null)],
> >>>>>>> partitions = [])
> >>>>>>> 2017-05-04 14:30:56,062 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> >>> node 2
> >>>>>>> 2017-05-04 14:30:56,073 [StreamThread-1] DEBUG o.a.k.c.Metadata:
> >>>> Updated
> >>>>>>> cluster metadata version 7 to Cluster(id = JsVqjH3tS4CIcqpd2jkogA,
> >>>> nodes
> >>>>>> =
> >>>>>>> [ip-10-0-95-250.ec2.internal:9092 (id: 2 rack: null), ip-10
> >>>>>>> -0-91-10.ec2.internal:9092 (id: 1 rack: null)], partitions =
> >>>>>>> [Partition(topic = sa-events, partition = 0, leader = 1, replicas =
> >>>>>> [1,2],
> >>>>>>> isr = [2,1]), Partition(topic = sa-events, partition = 1, lea
> >>>>>>> der = 2, replicas = [1,2], isr = [2,1]), Partition(topic =
> >> sa-events,
> >>>>>>> partition = 2, leader = 1, replicas = [1,2], isr = [2,1])])
> >>>>>>> 2017-05-04 14:31:06,085 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Disconnecting from node 2147483646 due to request timeout.
> >>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.ConsumerNetworkClient: Cancelled JOIN_GROUP request
> >>>>>>>
> >>>>>> {api_key=11,api_version=1,correlation_id=16,client_id=
> >>>> sa-5788b5a5-aadc-4276-916f
> >>>>>>> -1640008c17da-StreamThread-1-consumer} with correlation id 16 due
> >> to
> >>>> node
> >>>>>>> 2147483646 being disconnected
> >>>>>>> 2017-05-04 14:31:06,086 [StreamThread-1] INFO
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Marking the coordinator
> >>>>>>> ip-10-0-91-10.ec2.internal:9092 (id: 2147483646 rack: null) dead
> >> for
> >>>>>> group
> >>>>>>> sa
> >>>>>>> 2017-05-04 14:31:06,195 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending GroupCoordinator request
> >>> for
> >>>>>> group
> >>>>>>> sa to broker ip-10-0-91-10.ec2.internal:9092 (id: 1 rack: null)
> >>>>>>> 2017-05-04 14:31:06,196 [StreamThread-1] DEBUG
> >> o.a.k.c.NetworkClient:
> >>>>>>> Sending metadata request (type=MetadataRequest, topics=<ALL>) to
> >>> node 2
> >>>>>>> 2017-05-04 14:31:06,200 [StreamThread-1] DEBUG
> >>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received GroupCoordinator
> response
> >>>>>>> ClientResponse(receivedTimeMs=1493908266200, latencyMs=5,
> >>>>>>> disconnected=false, requestHeader=
> >>>>>>>
> >>>>>> {api_key=10,api_version=0,correlation_id=19,client_id=
> >>>> sa-5788b5a5-aadc-4276-916f-1640008c17da-StreamThread-1-consumer},
> >>>>>>> responseBody={error_code=0,coordinator={node_id=1,host=ip
> >>>> -10-0-91-10.ec
> >>>>>>> 2.internal,port=9092}}) for group sa
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, May 1, 2017 at 4:19 PM, Eno Thereska <
> >> eno.there...@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Shimi,
> >>>>>>>>
> >>>>>>>> 0.10.2.1 contains a number of fixes that should make the out of
> >> box
> >>>>>>>> experience better, including resiliency under broker failures and
> >>>> better
> >>>>>>>> exception handling. If you ever get back to it, and if the problem
> >>>>>> happens
> >>>>>>>> again, please do send us the logs and we'll happily have a look.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Eno
> >>>>>>>>> On 1 May 2017, at 12:05, Shimi Kiviti <shim...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Eno,
> >>>>>>>>> I am afraid I played too much with the configuration to make this
> >>>>>>>>> productive investigation :(
> >>>>>>>>>
> >>>>>>>>> This is a QA environment which includes 2 kafka instances and 3
> >>>>>> zookeeper
> >>>>>>>>> instances in AWS. There are only 3 partition for this topic.
> >>>>>>>>> Kafka broker and kafka-stream are version 0.10.1.1
> >>>>>>>>> Our kafka-stream app run on docker using kubernetes.
> >>>>>>>>> I played around with with 1 to 3  kafka-stream processes, but I
> >> got
> >>>> the
> >>>>>>>>> same results. It is too easy to scale with kubernetes :)
> >>>>>>>>> Since there are only 3 partitions, I didn't start more then 3
> >>>>>> instances.
> >>>>>>>>>
> >>>>>>>>> I was too quick to upgraded only the kafka-stream app to 0.10.2.1
> >>>> with
> >>>>>>>> hope
> >>>>>>>>> that it will solve the problem, It didn't.
> >>>>>>>>> The log I sent before are from this version.
> >>>>>>>>>
> >>>>>>>>> I did notice "unknown" offset for the main topic with
> >> kafka-stream
> >>>>>>>> version
> >>>>>>>>> 0.10.2.1
> >>>>>>>>> $ ./bin/kafka-consumer-groups.sh   --bootstrap-server
> >>> localhost:9092
> >>>>>>>>> --describe --group sa
> >>>>>>>>> GROUP                          TOPIC
> >>>> PARTITION
> >>>>>>>>> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> >>>>>>>>> sa             sa-events                 0          842199
> >>>>>>>>> 842199          0
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>> sa             sa-events                 1          1078428
> >>>>>>>>> 1078428         0
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>> sa             sa-events                 2          unknown
> >>>>>>>>> 26093910        unknown
> >>>>>>>>> sa-4557bf2d-ba79-42a6-aa05-5b4c9013c022-StreamThread-1-
> consumer_/
> >>>>>>>> 10.0.10.9
> >>>>>>>>>
> >>>>>>>>> After that I downgraded the kafka-stream app back to version
> >>> 0.10.1.1
> >>>>>>>>> After a LONG startup time (more than an hour) where the status of
> >>> the
> >>>>>>>> group
> >>>>>>>>> was rebalancing, all the 3 processes started processing messages
> >>>> again.
> >>>>>>>>>
> >>>>>>>>> This all thing started after we hit a bug in our code (NPE) that
> >>>>>> crashed
> >>>>>>>>> the stream processing thread.
> >>>>>>>>> So now after 4 days, everything is back to normal.
> >>>>>>>>> This worries me since it can happen again
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, May 1, 2017 at 11:45 AM, Eno Thereska <
> >>>> eno.there...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Shimi,
> >>>>>>>>>>
> >>>>>>>>>> Could you provide more info on your setup? How many kafka
> >> streams
> >>>>>>>>>> processes do you have and from how many partitions are they
> >>>> consuming
> >>>>>>>> from.
> >>>>>>>>>> If you have more processes than partitions some of the processes
> >>>> will
> >>>>>> be
> >>>>>>>>>> idle and won’t do anything.
> >>>>>>>>>>
> >>>>>>>>>> Eno
> >>>>>>>>>>> On Apr 30, 2017, at 5:58 PM, Shimi Kiviti <shim...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I have a problem and I hope one of you can help me figuring it
> >>> out.
> >>>>>>>>>>> One of our kafka-streams processes stopped processing messages
> >>>>>>>>>>>
> >>>>>>>>>>> When I turn on debug log I see lots of these messages:
> >>>>>>>>>>>
> >>>>>>>>>>> 2017-04-30 15:42:20,228 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.Fetcher:
> >>>>>>>>>> Sending
> >>>>>>>>>>> fetch for partitions [devlast-changelog-2] to broker ip-x-x-x-x
> >>>>>>>>>>> .ec2.internal:9092 (id: 1 rack: null)
> >>>>>>>>>>> 2017-04-30 15:42:20,696 [StreamThread-1] DEBUG
> >>> o.a.k.c.c.i.Fetcher:
> >>>>>>>>>>> Ignoring fetched records for devlast-changelog-2 at offset
> >>> 2962649
> >>>>>>>> since
> >>>>>>>>>>> the current position is 2963379
> >>>>>>>>>>>
> >>>>>>>>>>> After a LONG time, the only messages in the log are these:
> >>>>>>>>>>>
> >>>>>>>>>>> 2017-04-30 16:46:33,324 [kafka-coordinator-heartbeat-thread |
> >>> sa]
> >>>>>>>> DEBUG
> >>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Sending Heartbeat request for
> >>>> group
> >>>>>>>> sa
> >>>>>>>>>> to
> >>>>>>>>>>> coordinator ip-x-x-x-x.ec2.internal:9092 (id: 2147483646 rack:
> >>>> null)
> >>>>>>>>>>> 2017-04-30 16:46:33,425 [kafka-coordinator-heartbeat-thread |
> >>> sa]
> >>>>>>>> DEBUG
> >>>>>>>>>>> o.a.k.c.c.i.AbstractCoordinator: Received successful Heartbeat
> >>>>>>>> response
> >>>>>>>>>> for
> >>>>>>>>>>> group same
> >>>>>>>>>>>
> >>>>>>>>>>> Any idea?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Shimi
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
>
>


-- 
-- Guozhang

Re: Kafka-streams process stopped processing messages

Reply via email to