Re: Uneven distribution of messages in topic's partitions

Ricardo Ferreira Fri, 19 Jun 2020 11:06:19 -0700

Hi Hemant,

Being able to lookup specific records by key is not possible in Kafka.As a distributed streaming platform based on the concept of a commit logKafka organizes data sequentially where each record has an offset thatuniquely identifies not who the record is but where within the log it ispositioned.

In order to implement record lookup by key you would need to use KafkaStreams or ksqlDB. I would recommend ksqlDB since you can easily createa stream out of your existing topic and then make that streamtransformed into a table. Note only that currently ksqlDB requires thateach table that would serve pull requests (i.e.: queries that serverequests given a key) need to be created using an aggregation construct.So you might need to work that out in order to achieve the behavior thatyou want.


Thanks,

-- Ricardo

On 6/19/20 1:07 PM, Hemant Bairwa wrote:

Thanks Ricardo.

I need some information on more use case.

In my application I need to use Kafka to maintain the differentworkflow states of message items while processing through differentprocesses. For example in my application all messages transits fromProcess A to Process Z and I need to maintain all the processed statesby an item. So for item xyz there should be total 26 entries in Kafkatopic.

xyz, A
xyz, B... and so on.

User should be able to retrieve all the messages for any specific keyas many times. That is a DB type of feature is required.


1. Is Kafka alone is able to cater this requirement?

2. Or do I need to use KSql DB for meeting this requirement? I didsome research around it but I don't want to run separate KSql DB server.

3. Any other suggestions?

Regards,

On Thu, 18 Jun 2020, 6:51 pm Ricardo Ferreira, <rifer...@riferrei.com<mailto:rifer...@riferrei.com>> wrote:


    Hemant,

    This behavior might be the result of the version of AK (Apache
    Kafka) that you are using. Before AK 2.4 the default behavior for
    the DefaultPartitioner was to load balance data production across
    the partitions as you described. But it was found that this
    behavior would cause performance problems to the batching strategy
    that each producer does. Therefore, AK 2.4 introduced a new
    behavior into the DefaultPartitioner called sticky partitioning.
    You can follow up in this change reading up the KIP that was
    created for this change: *KIP-480
    
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-480%3A+Sticky+Partitioner>*.

    The only downside that I see in your workaround is if you are
    handling connections to the partitions programmatically. That
    would make your code fragile because if the # of partitions for
    the topic changes then your code would not know this. Instead,
    just use the RoundRobinPartitioner
    
<https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/producer/RoundRobinPartitioner.html>
    explicitly in your producer:

    ```

    configs.put("partitioner.class",
    "org.apache.kafka.clients.producer.RoundRobinPartitioner");

    ```

    Thanks,

    -- Ricardo

    On 6/18/20 12:38 AM, Hemant Bairwa wrote:

    Hello All

    I have a single producer service which is queuing message into a topic with
    let say 12 partitions. I want to evenly distribute the messages across all
    the partitions in a round robin fashion.
    Even after using default partitioning and keeping key 'NULL', the messages
    are not getting distributed evenly. Rather some partitions are getting none
    of the messages while some are getting multiple.
    One reason I found for this behaviour, somewhere, is that if there are
    lesser number of producers than the number of partitions, it distributes
    the messages to fewer partitions to limit many open sockets.
    However I have achieved even distribution through code by first getting
    total partition numbers and then passing partition number in the
    incremental order along with the message into the producer record. Once the
    partition number reaches end of the partition number then again resetting
    the next partition number to zero.

    Query:
    1. Is there can be any downside of above approach used?
    2. If yes, how to achieve even distribution of messages in an optimized way?

Re: Uneven distribution of messages in topic's partitions

Reply via email to