Hey everyone,

I've been looking at Flink to handle a fairly complex use case and was
hoping for some feedback on if the approach I'm thinking about with Flink
seems reasonable.  When researching what people build on Flink, it seems
like a lot of the focus tends to be on running fewer heavyweight/complex
jobs whereas the approach I'm thinking about involves executing many
potentially smaller and more lightweight jobs.

The core idea is that we have a lot (think 100s or 1000s) of incoming data
streams (maybe via something like Apache Pulsar) and we have rules, of
various complexities, that need to be executed against individual streams.
If a rule matches, an event needs to be emitted to an output stream.  The
rules could be as simple as "In any event, if you see field X set to value
'foo', it's a match" or more complex like "If you see an event of type A
followed by an event of type B followed by an event of type C in a certain
time window, then it's a match."  These rules are long-running (could be
hours, days, weeks, or longer).

It *seems* to me like Application Mode (
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/overview/)
with the Kubernetes Operator (
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#application-deployments)
which will create a new cluster per application seems like what I'd want.
I'm envisioning each of these long-running rules (which potentially each
read a different data stream) is its own job in its own application (maybe
later some can be combined, but to start, they'll all be separate).

Does that seem like the right approach to running a number of somewhat
small jobs concurrently on Flink?  Are there any "gotchas" to this I'm not
thinking of? Any alternate approaches that are worth considering?  Are
there any users we know of who do something like this currently?

Thanks for your time and insight!

~Brent

Reply via email to