[ https://issues.apache.org/jira/browse/KAFKA-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613028#comment-14613028 ]
Gianmarco De Francisci Morales commented on KAFKA-2092: ------------------------------------------------------- [~hachikuji] thanks for the feedback. If I understand correctly what you are saying, the problem is the 1:1 mapping of a partition to a consumer. That is, if you want to aggregate the state of the key in a single place, you need some extra work. I don't think this is a problem. Referring to the paper (1), we have a DAG with 3 layers: source -> worker -> aggregator The source does the ingestion to Kafka, and uses PKG to partition the topic from source to worker (let's call it S-W topic). The worker does partial aggregation of the key, one per partition. This still complies with the 1:1 mapping of partition to consumer that we have now in Kafka. The result is that the state derived from a single key is now split across 2 workers (because it was computed independently from 2 partitions). The worker then uses Key Grouping to send data to the aggregator on the W-A topic. This topic works again as a "normal" topic in Kafka, with 1-key-to-1-partition guarantee. The aggregator can do the final job of putting together the two partial states (which is a constant-time operation). What is missing from this scenario is the case when we want to keep the state disjoint until query time. That is, the state for key k is in partitions p1 and p2, identified as p1=h1(k) and p2=h2(k) (so the mapping is deterministic). Note that in this case the consumer task does *not* want to consume the whole partition, rather just the key k. There might be many other keys in p1 and p2 that the consumer task is not interested in. This mode of operation can be supported easily out of Kafka. To further elaborate on your point, I don't see a use case for producing a topic via PKG and then consuming it via KG, as it defeats the purpose of load balancing. To achieve any gain, we need to process the 2 partial sub-streams of a key independently, and to do so you need some processing layer in between. Otherwise it would be the same as using KG in the first place (all the messages with the same key are processed by the same consumer). This intermediate processing layer can repartition the stream and produce a topic partitioned via KG. Hope it's not too obscure :) > New partitioning for better load balancing > ------------------------------------------ > > Key: KAFKA-2092 > URL: https://issues.apache.org/jira/browse/KAFKA-2092 > Project: Kafka > Issue Type: Improvement > Components: producer > Reporter: Gianmarco De Francisci Morales > Assignee: Jun Rao > Attachments: KAFKA-2092-v1.patch > > > We have recently studied the problem of load balancing in distributed stream > processing systems such as Samza [1]. > In particular, we focused on what happens when the key distribution of the > stream is skewed when using key grouping. > We developed a new stream partitioning scheme (which we call Partial Key > Grouping). It achieves better load balancing than hashing while being more > scalable than round robin in terms of memory. > In the paper we show a number of mining algorithms that are easy to implement > with partial key grouping, and whose performance can benefit from it. We > think that it might also be useful for a larger class of algorithms. > PKG has already been integrated in Storm [2], and I would like to be able to > use it in Samza as well. As far as I understand, Kafka producers are the ones > that decide how to partition the stream (or Kafka topic). > I do not have experience with Kafka, however partial key grouping is very > easy to implement: it requires just a few lines of code in Java when > implemented as a custom grouping in Storm [3]. > I believe it should be very easy to integrate. > For all these reasons, I believe it will be a nice addition to Kafka/Samza. > If the community thinks it's a good idea, I will be happy to offer support in > the porting. > References: > [1] > https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf > [2] https://issues.apache.org/jira/browse/STORM-632 > [3] https://github.com/gdfm/partial-key-grouping -- This message was sent by Atlassian JIRA (v6.3.4#6332)