Thanks Liu Ron for the suggestion. Can you please give any pointers/Reference for the custom partitioning strategy, we are currently using murmur hashing with the device unique id. It would be helpful if we guide/refer any other strategies.
Thanks and regards Karthick. On Mon, Sep 18, 2023 at 9:18 AM liu ron <ron9....@gmail.com> wrote: > Hi, Karthick > > It looks like a data skewing problem, and I think one of the easiest and > most efficient ways for this issue is to increase the number of Partitions > and see how it works first, like try expanding by 100 first. > > Best, > Ron > > Karthick <ibmkarthickma...@gmail.com> 于2023年9月17日周日 17:03写道: > >> Thanks Wei Chen, Giannis for the time, >> >> >> For starters, you need to better size and estimate the required number of >>> partitions you will need on the Kafka side in order to process 1000+ >>> messages/second. >>> The number of partitions should also define the maximum parallelism for >>> the Flink job reading for Kafka. >> >> Thanks for the pointer, can you please guide on what are all the factors >> we need to consider regarding this. >> >> use a custom partitioner that spreads those devices to somewhat separate >>> partitions. >> >> Please suggest a working solution regarding the custom partitioner, to >> distribute the load. It will be helpful. >> >> >> What we were doing at that time was to define multiple topics and each >>> has a different # of partitions >> >> Thanks for the suggestion, is there any calculation for choosing topics >> count, is there are any formulae/factors to determine this topic number, >> please let me know if available it will be helpful for us to choose that. >> >> Thanks and Regards >> Karthick. >> >> >> >> On Sun, Sep 17, 2023 at 4:04 AM Wei Chen <sagitc...@qq.com> wrote: >> >>> Hi Karthick, >>> We’ve experienced the similar issue before. What we were doing at that >>> time was to define multiple topics and each has a different # of partitions >>> which means some of the topics with more partitions will have the high >>> parallelisms for processing. >>> And you can further divide the topics into several groups and each group >>> should have the similar # of partitions. For each group, you can define as >>> the source of flink data stream to run them in parallel with different >>> parallelism. >>> >>> >>> >>> >>> ------------------------------ >>> >>> >>> >>> ------------------ Original ------------------ >>> *From:* Giannis Polyzos <ipolyzos...@gmail.com> >>> *Date:* Sat,Sep 16,2023 11:52 PM >>> *To:* Karthick <ibmkarthickma...@gmail.com> >>> *Cc:* Gowtham S <gowtham.co....@gmail.com>, user <user@flink.apache.org> >>> *Subject:* Re: Urgent: Mitigating Slow Consumer Impact and Seeking >>> Open-SourceSolutions in Apache Kafka Consumers >>> >>> Can you provide some more context on what your Flink job will be doing? >>> There might be some things you can do to fix the data skew on the link >>> side, but first, you want to start with Kafka. >>> For starters, you need to better size and estimate the required number >>> of partitions you will need on the Kafka side in order to process 1000+ >>> messages/second. >>> The number of partitions should also define the maximum parallelism for >>> the Flink job reading for Kafka. >>> If you know your "hot devices" in advance you might wanna use a custom >>> partitioner that spreads those devices to somewhat separate partitions. >>> Overall this is somewhat of a trial-and-error process. You might also >>> want to check that these partitions are evenly balanced among your brokers >>> and don't cause too much stress on particular brokers. >>> >>> Best >>> >>> On Sat, Sep 16, 2023 at 6:03 PM Karthick <ibmkarthickma...@gmail.com> >>> wrote: >>> >>>> Hi Gowtham i agree with you, >>>> >>>> I'm eager to resolve the issue or gain a better understanding. Your >>>> assistance would be greatly appreciated. >>>> >>>> If there are any additional details or context needed to address my >>>> query effectively, please let me know, and I'll be happy to provide them. >>>> >>>> Thank you in advance for your time and consideration. I look forward to >>>> hearing from you and benefiting from your expertise. >>>> >>>> Thanks and Regards >>>> Karthick. >>>> >>>> On Sat, Sep 16, 2023 at 11:04 AM Gowtham S <gowtham.co....@gmail.com> >>>> wrote: >>>> >>>>> Hi Karthik >>>>> >>>>> This appears to be a common challenge related to a slow-consuming >>>>> situation. Those with relevant experience in addressing such matters >>>>> should >>>>> be capable of providing assistance. >>>>> >>>>> Thanks and regards, >>>>> Gowtham S >>>>> >>>>> >>>>> On Fri, 15 Sept 2023 at 23:06, Giannis Polyzos <ipolyzos...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Karthick, >>>>>> >>>>>> on a high level seems like a data skew issue and some partitions have >>>>>> way more data than others? >>>>>> What is the number of your devices? how many messages are you >>>>>> processing? >>>>>> Most of the things you share above sound like you are looking for >>>>>> suggestions around load distribution for Kafka. i.e number of >>>>>> partitions, >>>>>> how to distribute your device data etc. >>>>>> It would be good to also share what your flink job is doing as I >>>>>> don't see anything mentioned around that.. are you observing back >>>>>> pressure >>>>>> in the Flink UI? >>>>>> >>>>>> Best >>>>>> >>>>>> On Fri, Sep 15, 2023 at 3:46 PM Karthick <ibmkarthickma...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Dear Apache Flink Community, >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am writing to urgently address a critical challenge we've >>>>>>> encountered in our IoT platform that relies on Apache Kafka and >>>>>>> real-time >>>>>>> data processing. We believe this issue is of paramount importance and >>>>>>> may >>>>>>> have broad implications for the community. >>>>>>> >>>>>>> >>>>>>> >>>>>>> In our IoT ecosystem, we receive data streams from numerous devices, >>>>>>> each uniquely identified. To maintain data integrity and ordering, we've >>>>>>> meticulously configured a Kafka topic with ten partitions, ensuring that >>>>>>> each device's data is directed to its respective partition based on its >>>>>>> unique identifier. This architectural choice has proven effective in >>>>>>> maintaining data order, but it has also unveiled a significant problem: >>>>>>> >>>>>>> >>>>>>> >>>>>>> *One device's data processing slowness is interfering with other >>>>>>> devices' data, causing a detrimental ripple effect throughout our >>>>>>> system.* >>>>>>> >>>>>>> >>>>>>> To put it simply, when a single device experiences processing >>>>>>> delays, it acts as a bottleneck within the Kafka partition, leading to >>>>>>> delays in processing data from other devices sharing the same partition. >>>>>>> This issue undermines the efficiency and scalability of our entire data >>>>>>> processing pipeline. >>>>>>> >>>>>>> Additionally, I would like to highlight that we are currently using >>>>>>> the default partitioner for choosing the partition of each device's >>>>>>> data. >>>>>>> If there are alternative partitioning strategies that can help alleviate >>>>>>> this problem, we are eager to explore them. >>>>>>> >>>>>>> We are in dire need of a high-scalability solution that not only >>>>>>> ensures each device's data processing is independent but also prevents >>>>>>> any >>>>>>> interference or collisions between devices' data streams. Our primary >>>>>>> objectives are: >>>>>>> >>>>>>> 1. *Isolation and Independence:* We require a strategy that >>>>>>> guarantees one device's processing speed does not affect other devices >>>>>>> in >>>>>>> the same Kafka partition. In other words, we need a solution that >>>>>>> ensures >>>>>>> the independent processing of each device's data. >>>>>>> >>>>>>> >>>>>>> 2. *Open-Source Implementation:* We are actively seeking pointers >>>>>>> to open-source implementations or references to working solutions that >>>>>>> address this specific challenge within the Apache ecosystem or any >>>>>>> existing >>>>>>> projects, libraries, or community-contributed solutions that align with >>>>>>> our >>>>>>> requirements would be immensely valuable. >>>>>>> >>>>>>> We recognize that many Apache Flink users face similar issues and >>>>>>> may have already found innovative ways to tackle them. We implore you to >>>>>>> share your knowledge and experiences on this matter. Specifically, we >>>>>>> are >>>>>>> interested in: >>>>>>> >>>>>>> *- Strategies or architectural patterns that ensure independent >>>>>>> processing of device data.* >>>>>>> >>>>>>> *- Insights into load balancing, scalability, and efficient data >>>>>>> processing across Kafka partitions.* >>>>>>> >>>>>>> *- Any existing open-source projects or implementations that address >>>>>>> similar challenges.* >>>>>>> >>>>>>> >>>>>>> >>>>>>> We are confident that your contributions will not only help us >>>>>>> resolve this critical issue but also assist the broader Apache Flink >>>>>>> community facing similar obstacles. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Please respond to this thread with your expertise, solutions, or any >>>>>>> relevant resources. Your support will be invaluable to our team and the >>>>>>> entire Apache Flink community. >>>>>>> >>>>>>> Thank you for your prompt attention to this matter. >>>>>>> >>>>>>> >>>>>>> Thanks & Regards >>>>>>> >>>>>>> Karthick. >>>>>>> >>>>>>