Hello Jan, One alternative approach you can consider is to use combo <team, user> as the key, hence it achieves the small aggregation, while customizing your partitioner for the repartition topic such that keys with the same <team> prefix always go to the same partition. Then when cleaning up data, similarly within the store you can do a range on prefix <team> and delete all entries of <team, user> when the team is removed.
Guozhang On Mon, Oct 26, 2020 at 1:39 PM Jan Bols <janb...@telenet.be> wrote: > For a kafka-streams application, we keep data per team. Data from 2 teams > never meet but within a team, data is highly integrated. A team has team > members but also has several types of equipment. > A team has a lifespan of about 1-3 days after which the team is removed and > all data relating to that team should be evicted. > > How would you partition the data? > - Using the team id as key for all streams seems not ideal b/c this means > all aggregations need to happen per team involving a ser/deser of the > entire team data. Suppose there's 10 team members and only 1 team member is > sending events that need to be aggregated. In this case, we need a > ser/deser of the entire aggregated team data. I'm afraid this would result > in quite a bit of overhead because. > - Using the user id or equipment id as key would result in much smaller > aggregations but does mean quite a bit of repartitioning when aggregating > and joining users of the same team. > > I ended up using the second approach, but I wonder if that was really a > good idea b/c the entire streaming logic does become quite involved. > > What is your experience with this type of data? > > Best regards > Jan > -- -- Guozhang