If you use the `flatMap/mapGroupsWithState` API for a "stateful" SS job,
the blacklisting structure can be put into the user-defined state.
To use a 3rd-party cache should also be a good choice.

Eric Beabes <mailinglist...@gmail.com> 于2020年11月11日周三 上午6:54写道:

> Currently we’ve a “Stateful” Spark Structured Streaming job that computes
> aggregates for each ID. I need to implement a new requirement which says
> that if the no. of incoming messages for a particular ID exceeds a certain
> value then add this ID to a blacklist & remove the state for it. Going
> forward for any ID that’s blacklisted we will not create a state for it.
> The message will simply get filtered out if the ID is blacklisted.
>
> What’s the best way to implement this in Spark Structured Streaming?
> Essentially what we need to do is create a Distributed HashSet that gets
> updated intermittently & make this HashSet available to all Executors so
> that they can filter out unwanted messages.
>
> Any pointers would be greatly appreciated. Is the only option to use a
> 3rdparty Distributed Cache tool such as EhCache, Redis etc?
>
>
>
>

Reply via email to