Hi,

we want to cluster a stream of Tweets using Flink. Every incoming tweet is compared to the last 100 tweets. After this comparison, a cluster ID is assigned to the tweet. We try to find out the best approach how to solve this:

1. Using a stream window of the last tweets seems to be difficult because we would need to cross join this window with every incoming tweet. According to my research the Flink API does not support cross joins on stream windows. 2. We could also store the last 100 tweets in one operator with parallelism=1. This would work but it introduces a bottleneck. 3. We could share the last 100 tweets as a "shared state" among the operator that assigns the cluster. But every tweet changes the state so there would be a lot of synchronization effort between the operators.

Are you aware of other possible solutions? Currently solution #2 seems the most promising to me but I do not like the bottleneck.

Best regards Jan

Reply via email to