Hi All, It will be helpful if we gave any pointers to the problem addressed.
Thanks Karthick. On Wed, Sep 20, 2023 at 3:03 PM Gowtham S <gowtham.co....@gmail.com> wrote: > Hi Spark Community, > > Thank you for bringing up this issue. We've also encountered the same > challenge and are actively working on finding a solution. It's reassuring > to know that we're not alone in this. > > If you have any insights or suggestions regarding how to address this > problem, please feel free to share them. > > Looking forward to hearing from others who might have encountered similar > issues. > > > Thanks and regards, > Gowtham S > > > On Tue, 19 Sept 2023 at 17:23, Karthick <ibmkarthickma...@gmail.com> > wrote: > >> Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem >> >> Dear Spark Community, >> >> I recently reached out to the Apache Flink community for assistance with >> a critical issue we are facing in our IoT platform, which relies on Apache >> Kafka and real-time data processing. We received some valuable insights and >> suggestions from the Apache Flink community, and now, we would like to seek >> your expertise and guidance on the same problem. >> >> In our IoT ecosystem, we are dealing with data streams from thousands of >> devices, each uniquely identified. To maintain data integrity and ordering, >> we have configured a Kafka topic with ten partitions, ensuring that each >> device's data is directed to its respective partition based on its unique >> identifier. While this architectural choice has been effective in >> maintaining data order, it has unveiled a significant challenge: >> >> *Slow Consumer and Data Skew Problem:* When a single device experiences >> processing delays, it acts as a bottleneck within the Kafka partition, >> leading to delays in processing data from other devices sharing the same >> partition. This issue severely affects the efficiency and scalability of >> our entire data processing pipeline. >> >> Here are some key details: >> >> - Number of Devices: 1000 (with potential growth) >> - Target Message Rate: 1000 messages per second (with expected growth) >> - Kafka Partitions: 10 (some partitions are overloaded) >> - We are planning to migrate from Apache Storm to Apache Flink/Spark. >> >> We are actively seeking guidance on the following aspects: >> >> *1. Independent Device Data Processing*: We require a strategy that >> guarantees one device's processing speed does not affect other devices in >> the same Kafka partition. In other words, we need a solution that ensures >> the independent processing of each device's data. >> >> *2. Custom Partitioning Strategy:* We are looking for a custom >> partitioning strategy to distribute the load evenly across Kafka >> partitions. Currently, we are using Murmur hashing with the device's unique >> identifier, but we are open to exploring alternative partitioning >> strategies. >> >> *3. Determining Kafka Partition Count:* We seek guidance on how to >> determine the optimal number of Kafka partitions to handle the target >> message rate efficiently. >> >> *4. Handling Data Skew:* Strategies or techniques for handling data skew >> within Apache Flink. >> >> We believe that many in your community may have faced similar challenges >> or possess valuable insights into addressing them. Your expertise and >> experiences can greatly benefit our team and the broader community dealing >> with real-time data processing. >> >> If you have any knowledge, solutions, or references to open-source >> projects, libraries, or community-contributed solutions that align with our >> requirements, we would be immensely grateful for your input. >> >> We appreciate your prompt attention to this matter and eagerly await your >> responses and insights. Your support will be invaluable in helping us >> overcome this critical challenge. >> >> Thank you for your time and consideration. >> >> Thanks & regards, >> Karthick. >> >