Hello Jan,

One alternative approach you can consider is to use combo <team, user> as
the key, hence it achieves the small aggregation, while customizing your
partitioner for the repartition topic such that keys with the same <team>
prefix always go to the same partition. Then when cleaning up data,
similarly within the store you can do a range on prefix <team> and delete
all entries of <team, user> when the team is removed.

Guozhang




On Mon, Oct 26, 2020 at 1:39 PM Jan Bols <janb...@telenet.be> wrote:

> For a kafka-streams application, we keep data per team. Data from 2 teams
> never meet but within a team, data is highly integrated. A team has team
> members but also has several types of equipment.
> A team has a lifespan of about 1-3 days after which the team is removed and
> all data relating to that team should be evicted.
>
> How would you partition the data?
> - Using the team id as key for all streams seems not ideal b/c this means
> all aggregations need to happen per team involving a ser/deser of the
> entire team data. Suppose there's 10 team members and only 1 team member is
> sending events that need to be aggregated. In this case, we need a
> ser/deser of the entire aggregated team data. I'm afraid this would result
> in quite a bit of overhead because.
> - Using the user id or equipment id as key would result in much smaller
> aggregations but does mean quite a bit of repartitioning when aggregating
> and joining users of the same team.
>
> I ended up using the second approach, but I wonder if that was really a
> good idea b/c the entire streaming logic does become quite involved.
>
> What is your experience with this type of data?
>
> Best regards
> Jan
>


-- 
-- Guozhang

Reply via email to