Re: Kafka Streams 2.1.0, 3rd time data lose investigation

John Roesler Thu, 03 Jan 2019 13:37:05 -0800

Hi Nitay,

I'm sorry to hear of these troubles; it sounds frustrating.


No worries about spamming the list, but it does sound like this might be
worth tracking as a bug report in Jira.
Obviously, we do not expect to lose data when instances come and go,
regardless of the frequency, and we do have tests in place to verify this.
Of course, you might be exercising something that our tests miss.

Thanks for collating the logs. It really helps to understand what's going
on.

Unfortunately, the red coloring didn't make it through the mailing list, so
I'm not sure which specific line you were referencing as demonstrating data
loss.

Just in case you're concerned about the "Updating StandbyTasks failed"
warnings, they should be fine. It indicates that a thread was unable to
re-use a state store that it had previously been assigned in the past, so
instead it deletes the local data and recreates the whole thing from the
changelog.

The Streams logs that would be really useful to capture are the lifecycle
ones, like

[2018-12-14 17:34:30,326] INFO stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> State transition from RUNNING to PARTITIONS_REVOKED
> (org.apache.kafka.streams.processor.internals.StreamThread)



[2018-12-14 17:34:30,326] INFO stream-client
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980] State
> transition from RUNNING to REBALANCING
> (org.apache.kafka.streams.KafkaStreams)


Also, it would be helpful to see the assignment transitions in line with
the state transitions. Examples:

[2018-12-14 17:34:31,863] DEBUG stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> Adding assigned tasks as active: {0_2=[standbyTaskSource1-2],
> 0_5=[standbyTaskSource1-5]}
> (org.apache.kafka.streams.processor.internals.TaskManager)



[2018-12-14 17:34:31,882] DEBUG stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> Adding assigned standby tasks {0_4=[standbyTaskSource1-4],
> 0_1=[standbyTaskSource1-1]}
> (org.apache.kafka.streams.processor.internals.TaskManager)



[2018-12-14 17:34:31,885] INFO stream-thread
> [kafka-streams-standby-tasks-75ca0cca-cc0b-4524-843c-2d9d1d555980-StreamThread-1]
> partition assignment took 22 ms.
>    current active tasks: [0_2, 0_5]
>    current standby tasks: [0_1, 0_4]
>    previous active tasks: []
>  (org.apache.kafka.streams.processor.internals.StreamThread)



I look forward to hearing back from you (either with more detailed logs or
just a clarification about how the given logs indicate data loss). A report
about potential data loss is very concerning.

Thanks,
-John

On Sun, Dec 30, 2018 at 9:23 AM Nitay Kufert <nita...@ironsrc.com> wrote:

> Hey everybody,
> We are running Kafka streams in production for the last year or so - we
> currently using the latest version (2.1.0) and we suffered from data lose
> several times before.
> The first time we noticed a data loss, we were able to trace it back to
> Exception that we were getting in the code - which eventually mapped to an
> open bug that the Kafka team is still working on. So the temporary solution
> was to disable the feature that causes the Exception (in this case - it was
> the "exactly_once" semantics) and move to "at_lease_once" semantics + piece
> of code that handles duplications.
> The 2nd time we noticed a data loss, we traced it back to some kind of
> Exception caused by lack of memory. To make a long story short - we hit the
> limit for open files on the machines (a lot of files are used by rocksDB) -
> so increasing the RAM of the machines & increasing the number of allowed
> open files on the OS solved this problem.
>
> Now, we are facing data loss for the 3rd time - this time it seems to
> happen when our Kafka stream instances switch (reproducible - happened 2
> separate times). let me explain:
> We are using a 3rd party company called Spotinst - which basically helps
> you save costs by monitoring the Amazon spot market, and switching between
> instances when they find a cheaper one.
>
> The question is, why would it cause data loss?
> Those are logs I collected and put together in a single timeline, including
> messages from Kafka stream instances (from Kibana), Spotinst (3rd party
> company) & the data in the compacted topic where the data should have been
> kept (basically its a compacted topic behind a reduce function - and it
> seems like the aggregated data was lost and the function was invocated as
> if its the first time its aggregating anything).
> What you are seeing is that Spotinst saw an increase in CPU - and initiated
> an Upscale (2 instances), and shortly after it - 2 instances went down
> (Downscale) as the load was over. In *RED* you can see the actual data loss
> (as observed from the compacted topic)
>
> DATE TIME FACILITY INFO
> 12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy Name:
> Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
> 12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
> I-instace1
> 12/25/2018 5:22:54 instace2 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
> StandbyTasks stores to recreate from scratch.:
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> Offsets out of range with no configured reset policy for partitions:
> {bos-unique_input_message-changelog-1=7491727}
> 12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
> 12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
> 12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To Load
> Balancer
> 12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To Load
> Balancer
> 12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
> 12/25/2018 5:23:40 instace2 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:23:46 instace3
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
> Offsets out of range with no configured reset policy for partitions:
> {bos-unique_input_message-changelog-1=7491727}
> 12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling: Policy
> Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
> 12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered From
> CLASSIC Load Balancer
> 12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> Instance I-instace2
> 12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
> I-instace2 (View Details)
> 12/25/2018 5:27:27 instace4 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> 12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
> Deregistered From CLASSIC Load Balancer
> 12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling: Policy
> Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
> 12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
> I-instace3 (View Details)
> 12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
> Instance I-instace3
> 12/25/2018 5:33:28 instace4 The following subscribed topics are not
> assigned to any members: [bosThresholds]
> I can provide much more information but I feel like I already spammed the
> group as it is :/
> Hope you can make sense of what I am writing and maybe shed some light on
> the possible reasons for this strange behavior.
>
> For now, as a temporary solution, we are moving to "on-demand" instances
> (which basically means that machines won't go up and down often), so I hope
> it will solve our problems.
>
> Thanks
> --
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com/>
>
> email nita...@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> Derech Menachem Begin 121, Tel- Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Re: Kafka Streams 2.1.0, 3rd time data lose investigation

Reply via email to