Currently we’ve a “Stateful” Spark Structured Streaming job that computes aggregates for each ID. I need to implement a new requirement which says that if the no. of incoming messages for a particular ID exceeds a certain value then add this ID to a blacklist & remove the state for it. Going forward for any ID that’s blacklisted we will not create a state for it. The message will simply get filtered out if the ID is blacklisted.
What’s the best way to implement this in Spark Structured Streaming? Essentially what we need to do is create a Distributed HashSet that gets updated intermittently & make this HashSet available to all Executors so that they can filter out unwanted messages. Any pointers would be greatly appreciated. Is the only option to use a 3rdparty Distributed Cache tool such as EhCache, Redis etc?