Hello Dev Mailing List,
I have a feature proposition that I am interested to hear more and to gather
feedback from everyone on the mailing list.
(please view this email in monospace font when possible)
_*Proposal*_*
*
Currently, when Flink DataStream API is to be utilized, every data processing
logic will need to be written and compiled before job deployment. And if
there's any change to the logic the code generally need to be recompiled.
To support the use case of a more generic deployment process we can take
inspiration from platforms such as Apache Airflow and Apache Spark, where the
support for externalizing operator logic exist.
This can be achieved by extending Default Flink Operators used for filter,
process, keyBy, flatMap, map ... etc. As we are merely adding an externalized
component "Script" that gets executed instead of pre-compiled code.
Surface Level implementation idea:
ScriptFilterFunction extends RichFilterFunction
ScriptProcessFunction extends RichProcessFunction
ScriptMapFunction extends RichMapFunction
…etc
--------------------
| RichFilterFunction |
--------------------
▲
|
----------------------
| ScriptFilterFunction |
----------------------
We can add support progressively, not every operator will need its script
counterpart in the initial release.
Every supported operator will have its Script counterpart where each one of
them will have at least one Implemented ScriptExecutor interface; Co*Function
may require two. We may also add metrics corresponding to each extended
function to track script performance.
There will also be a default ScriptExecutor impl that does nothing if user does
not define a script to be used.
ScriptExecutor is an interface that can be implemented to support various
externalized script languages.
The interface will include two abstract methods setup() and execute() /
execute(...args) where
• setup is used to initialize any execution engine or compilation requirement
• execute is used to handle invocation of the script function given an optional
input object
----------------
| ScriptExecutor |
----------------
△ ◸
| ⟍
| ⟍
| ⟍
-------------------- ----------------------
| BashScriptExecutor | | GroovyScriptExecutor |
-------------------- ----------------------
Depending on the ScriptExecutor, it will either accept file path(s) that
utilizes Flink’s own FileSystem API for script file loading, or script text
that defines the processing logic.
_*Use Case*_
Supposed we have a super generic Job that does the following:
[Kafka] -> [Filter] -> [Kafka]
Instead of having to recompile DataStream API based binary every time filter
condition is to be changed, we can now specify a file (may even be from a job
parameter or some sort of configuration file) that’s located on an object
storage or remote filesystem supported by Flink, and simply redeploy the job.
Same job setup can also be used to support multiple similar job with minor
differences in filtering criteria.
Given that Flink is written in Scala & Java, JVM based language such as Groovy
& Scala scripts that can be compiled into JVM Byte Code at runtime can be
supported easily. And performance will generally be on-par as pre-compiled
deployable binary.
By making this part of Flink Operator Set, we will be able to open up an
universal script support that rivals other platform and framework.
Looking forward to hearing back from the community with regards to this feature.
Thanks,
K