Its better to have partition key based on some f(user). This way a partition will always have same set of users and any new user would get assigned to one of these partitions.
You can probably check https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html For kafka to spark integration. On Sun, Dec 22, 2019 at 7:54 PM Girish Vasmatkar < girish.vasmat...@hotwaxsystems.com> wrote: > Hi All > > I have recently subscribed and am fairly new to Kafka so please pardon if > the question sounds too naive! > > I'm trying to build a POC on clickstream analysis for logged in and > Anonymous users for our e-commerce application. I am coming after visiting > this thread - > > https://stackoverflow.com/questions/32761598/is-it-possible-to-create-a-kafka-topic-with-dynamic-partition-count > > I've following questions in this regard - > > 1. I am planning to have as many number of partitions as the number of > users we have in the system such that clickstream events for individual > users go in the dedicated partition for that user. The dedicated > partition > id to be derived the user id of the user. Does this sound like a decent > approach? If not, what is the suggested way to go about this? > 2. If a partition per user strategy is good enough, then what happens > when a new user signs up and obviously will have a new and unique user > id. > I am not sure if we can add a new partition to a new topic? > 3. This kafta streaming is going to be consumed by spark streaming job > (Kafka Consumer). How do I set it up so that it gets clickstrem events > from > kafka topic for all users (irrespective of the partition id). In other > words, can we have a one-for-all consumer for a topic? > > Best, > Girish >