Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

Karthick Tue, 19 Sep 2023 01:47:36 -0700

Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

Dear Spark Community,


I recently reached out to the Apache Flink community for assistance with a
critical issue we are facing in our IoT platform, which relies on Apache
Kafka and real-time data processing. We received some valuable insights and
suggestions from the Apache Flink community, and now, we would like to seek
your expertise and guidance on the same problem.

In our IoT ecosystem, we are dealing with data streams from thousands of
devices, each uniquely identified. To maintain data integrity and ordering,
we have configured a Kafka topic with ten partitions, ensuring that each
device's data is directed to its respective partition based on its unique
identifier. While this architectural choice has been effective in
maintaining data order, it has unveiled a significant challenge:

*Slow Consumer and Data Skew Problem:* When a single device experiences
processing delays, it acts as a bottleneck within the Kafka partition,
leading to delays in processing data from other devices sharing the same
partition. This issue severely affects the efficiency and scalability of
our entire data processing pipeline.

Here are some key details:

- Number of Devices: 1000 (with potential growth)
- Target Message Rate: 1000 messages per second (with expected growth)
- Kafka Partitions: 10 (some partitions are overloaded)
- We are planning to migrate from Apache Storm to Apache Flink/Spark.

We are actively seeking guidance on the following aspects:

*1. Independent Device Data Processing*: We require a strategy that
guarantees one device's processing speed does not affect other devices in
the same Kafka partition. In other words, we need a solution that ensures
the independent processing of each device's data.

*2. Custom Partitioning Strategy:* We are looking for a custom partitioning
strategy to distribute the load evenly across Kafka partitions. Currently,
we are using Murmur hashing with the device's unique identifier, but we are
open to exploring alternative partitioning strategies.

*3. Determining Kafka Partition Count:* We seek guidance on how to
determine the optimal number of Kafka partitions to handle the target
message rate efficiently.

*4. Handling Data Skew:* Strategies or techniques for handling data skew
within Apache Flink.

We believe that many in your community may have faced similar challenges or
possess valuable insights into addressing them. Your expertise and
experiences can greatly benefit our team and the broader community dealing
with real-time data processing.

If you have any knowledge, solutions, or references to open-source
projects, libraries, or community-contributed solutions that align with our
requirements, we would be immensely grateful for your input.

We appreciate your prompt attention to this matter and eagerly await your
responses and insights. Your support will be invaluable in helping us
overcome this critical challenge.

Thank you for your time and consideration.

Thanks & regards,
Karthick.

Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

Reply via email to