Kafka Streams 2.1.0, 3rd time data lose investigation

Nitay Kufert Sun, 30 Dec 2018 07:23:38 -0800

Hey everybody,
We are running Kafka streams in production for the last year or so - we
currently using the latest version (2.1.0) and we suffered from data lose
several times before.
The first time we noticed a data loss, we were able to trace it back to
Exception that we were getting in the code - which eventually mapped to an
open bug that the Kafka team is still working on. So the temporary solution
was to disable the feature that causes the Exception (in this case - it was
the "exactly_once" semantics) and move to "at_lease_once" semantics + piece
of code that handles duplications.
The 2nd time we noticed a data loss, we traced it back to some kind of
Exception caused by lack of memory. To make a long story short - we hit the
limit for open files on the machines (a lot of files are used by rocksDB) -
so increasing the RAM of the machines & increasing the number of allowed
open files on the OS solved this problem.


Now, we are facing data loss for the 3rd time - this time it seems to
happen when our Kafka stream instances switch (reproducible - happened 2
separate times). let me explain:
We are using a 3rd party company called Spotinst - which basically helps
you save costs by monitoring the Amazon spot market, and switching between
instances when they find a cheaper one.

The question is, why would it cause data loss?
Those are logs I collected and put together in a single timeline, including
messages from Kafka stream instances (from Kibana), Spotinst (3rd party
company) & the data in the compacted topic where the data should have been
kept (basically its a compacted topic behind a reduce function - and it
seems like the aggregated data was lost and the function was invocated as
if its the first time its aggregating anything).
What you are seeing is that Spotinst saw an increase in CPU - and initiated
an Upscale (2 instances), and shortly after it - 2 instances went down
(Downscale) as the load was over. In *RED* you can see the actual data loss
(as observed from the compacted topic)

DATE TIME FACILITY INFO
12/25/2018 5:17:03 Spotinst Instances Launched - Autoscaling: Policy Name:
Scaling Policy-Up, Threshold: 70.0, Value Observed: 70.0
12/25/2018 5:22:34 Spotinst Got Signal INSTANCE_READY For Instance
I-instace1
12/25/2018 5:22:54 instace2 The following subscribed topics are not
assigned to any members: [bosThresholds]
12/25/2018 5:22:56 instace3 Updating StandbyTasks failed. Deleting
StandbyTasks stores to recreate from scratch.:
org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
Offsets out of range with no configured reset policy for partitions:
{bos-unique_input_message-changelog-1=7491727}
12/25/2018 5:23:21 Spotinst instace4 - INSTANCE_READY
12/25/2018 5:23:27 Compacted Topic cdr_44334 -> 1621.72
12/25/2018 5:23:33 Spotinst instace4 - Successfully Registered To Load
Balancer
12/25/2018 5:23:33 Spotinst instace1 - Successfully Registered To Load
Balancer
12/25/2018 5:23:34 Compacted Topic cdr_44334-> 0.27
12/25/2018 5:23:40 instace2 The following subscribed topics are not
assigned to any members: [bosThresholds]
12/25/2018 5:23:46 instace3
org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
Offsets out of range with no configured reset policy for partitions:
{bos-unique_input_message-changelog-1=7491727}
12/25/2018 5:27:05 Spotinst Instances Terminated - Autoscaling: Policy
Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 6.0
12/25/2018 5:27:05 Spotinst instace2 - Was Successfully Deregistered From
CLASSIC Load Balancer
12/25/2018 5:27:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
Instance I-instace2
12/25/2018 5:27:23 Spotinst Shutdown Script Completed. InstanceId:
I-instace2 (View Details)
12/25/2018 5:27:27 instace4 The following subscribed topics are not
assigned to any members: [bosThresholds]
12/25/2018 5:33:03 Spotinst Instances I-instace3 Was Successfully
Deregistered From CLASSIC Load Balancer
12/25/2018 5:33:03 Spotinst Instances Terminated - Autoscaling: Policy
Name: Scaling Policy-Down, Threshold: 40.0, Value Observed: 25.0
12/25/2018 5:33:23 Spotinst Shutdown Script Completed. InstanceId:
I-instace3 (View Details)
12/25/2018 5:33:23 Spotinst Got Signal INSTANCE_READY_TO_SHUTDOWN For
Instance I-instace3
12/25/2018 5:33:28 instace4 The following subscribed topics are not
assigned to any members: [bosThresholds]
I can provide much more information but I feel like I already spammed the
group as it is :/
Hope you can make sense of what I am writing and maybe shed some light on
the possible reasons for this strange behavior.

For now, as a temporary solution, we are moving to "on-demand" instances
(which basically means that machines won't go up and down often), so I hope
it will solve our problems.

Thanks
-- 
Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com/>

email nita...@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
Derech Menachem Begin 121, Tel- Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Kafka Streams 2.1.0, 3rd time data lose investigation

Reply via email to