In your producer application, you could write a logic for key1 & key2 to be produced to separate topics with appropriate number of partitions for that key according to your throughput requirements. For the consumer application you can configure it to consumer from multiple topics or can have instances dedicated to a specific topic. This way you are isolating the skewness into a dedicated topic for certain key(s). So you can have 2 or 3 sets of topics for your messages - high volume topic, medium volume topic and low volume topic; your producer application produces to the respective tier based on the 'key' - if key = 1, then topic1 or if key = 10 the topic3. This approach can also potentially help reduce the number of partitions per topic.
5 messages per seconds is quite slow for processing. Think of ways to optimize - think batching. A large number of partitions does not come without its downsides - here is the link <https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/> for your reference. On Thu, Aug 22, 2024 at 8:04 AM Karthick <ibmkarthickma...@gmail.com> wrote: > Thanks Akash Jain for your detailed explanation. I have answered your > queries. > > Yes as you said the key 1 and key 2 doing more produces. So the skew is > happening. > > In which case you will have to get creative with your topic > > design > > > Can you please guide on how to do dynamic segregation of such keys > to separate topics or is there any design avail reg this. Please tell. > > We need to process 500 messages per sec on the consumer side. But the > processing layer is doing ~5 messages per sec so we choose 96 (which is > closer to 100). > > > On Wed, Aug 21, 2024 at 5:37 PM Akash Jain <akashjain0...@gmail.com> > wrote: > > > Hi Karthick, > > > > The choice has to be yours depending on what you want to achieve. I > > understand you want to achieve even distribution of messages across your > > partitions. This depends on the following factors: > > > > - The frequency of keys > > - Hashing logic itself > > > > What you can control is the hashing logic - one of the ways could be > > hardcoding the keys and corresponding partition number in your logic > (this > > is assuming that you have a small pool of distinct keys). This will > > definitively ensure that your algorithm is not 'biased' when returning > the > > partition number. For example: > > > > key1 : partition 0 > > key2 : partition 1 > > key3 : partition 2 > > key4 : partition 3 > > key5 : partition 4 > > key6 : partition 0 > > . > > . > > . > > > > However, if your data contains a high number of specific keys, skewness > > cannot be entirely avoided. For example: if you have key1, key2 being > > produced most of the times, then you will observe partitions 0 and 1 to > be > > loaded more than the other partitions. > > > > You need to identify the reason for skewness. Is it the hashing algorithm > > or frequency of keys itself that is causing skewness? If it is the > > frequency of keys, then there is not much that can be done with just one > > topic alone. In which case you will have to get creative with your topic > > design - for example you can have separate topics for certain high > > frequency keys! > > > > Moreover, first you should assess why you have 96 partitions. In my > > experience that is way too high. > > > > Thanks > > > > On Tue, Aug 20, 2024 at 4:36 PM Karthick <ibmkarthickma...@gmail.com> > > wrote: > > > > > Hi Akash Jain > > > Thanks for the reply seeking help for the same to choose hashing > logics. > > > Please refer/suggest any. > > > > > > On Sat, Aug 17, 2024 at 10:21 AM Akash Jain <akashjain0...@gmail.com> > > > wrote: > > > > > > > Hi Karthick. You could implement your own custom partitioner. > > > > > > > > On Saturday, August 17, 2024, Karthick <ibmkarthickma...@gmail.com> > > > wrote: > > > > > > > > > Hi Team, > > > > > > > > > > I'm using Kafka partitioning to maintain field-based ordering > across > > > > > partitions, but I'm experiencing data skewness among the > partitions. > > I > > > > have > > > > > 96 partitions, and I'm sending data with 500 distinct keys that are > > > used > > > > > for partitioning. While monitoring the Kafka cluster, I noticed > that > > a > > > > few > > > > > partitions are underutilized while others are overutilized. > > > > > > > > > > This seems to be a hashing problem. Can anyone suggest a better > > hashing > > > > > technique or partitioning strategy to balance the load more > > > effectively? > > > > > > > > > > Thanks in advance for your help. > > > > > > > > > > > > > > >