stream clustering in flink

Jan Nehring Mon, 06 Feb 2017 04:40:25 -0800

Hi,

we want to cluster a stream of Tweets using Flink. Every incoming tweetis compared to the last 100 tweets. After this comparison, a cluster IDis assigned to the tweet. We try to find out the best approach how tosolve this:

1. Using a stream window of the last tweets seems to be difficultbecause we would need to cross join this window with every incomingtweet. According to my research the Flink API does not support crossjoins on stream windows.2. We could also store the last 100 tweets in one operator withparallelism=1. This would work but it introduces a bottleneck.3. We could share the last 100 tweets as a "shared state" among theoperator that assigns the cluster. But every tweet changes the state sothere would be a lot of synchronization effort between the operators.

Are you aware of other possible solutions? Currently solution #2 seemsthe most promising to me but I do not like the bottleneck.


Best regards Jan

stream clustering in flink

Reply via email to