Currently we’ve a “Stateful” Spark Structured Streaming job that computes
aggregates for each ID. I need to implement a new requirement which says
that if the no. of incoming messages for a particular ID exceeds a certain
value then add this ID to a blacklist & remove the state for it. Going
forward for any ID that’s blacklisted we will not create a state for it.
The message will simply get filtered out if the ID is blacklisted.

What’s the best way to implement this in Spark Structured Streaming?
Essentially what we need to do is create a Distributed HashSet that gets
updated intermittently & make this HashSet available to all Executors so
that they can filter out unwanted messages.

Any pointers would be greatly appreciated. Is the only option to use a
3rdparty Distributed Cache tool such as EhCache, Redis etc?

Reply via email to