To say it in different words to make sure I understand. Is the thinking that the workflow would execute in a linear fashion (assuming that was what the redshift pipeline was doing) to prevent needing to keep track of the state of the job?
On Mon, Nov 20, 2023 at 7:02 AM Bertil Chapuis <bchap...@gmail.com> wrote: > Hello everyone, > > It would be great to improve the workflow engine in Baremaps (package: > org.apache.baremaps.workflow). In the current version, a workflow is a > directed acyclic graph (DAG) of steps. Each step can have one or more tasks > executed sequentially or in parallel. The inputs and outputs of the tasks > are set manually. Some of the outputs (e.g., a table created in a database) > are not described. Furthermore, some resources (e.g., DataSources) are > shared across the workflow with a context object, but one task must be > aware of what another task did to benefit from shared resources. This > approach is loosely based on GitHub Actions. > > A nice improvement would be to remove the notion of step, to > systematically describe the inputs and outputs of the tasks, and to > introduce a format in the configuration file to describe the shared > resources accessed via the context object. This would probably make the > configuration file of the workflow more difficult to read, but at least, > everything would be declared in it. The DAG could be inferred from the > inputs and outputs of the tasks. This new approach would probably be closer > to what AWS Data Pipeline does. > > > https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-redshift-define-pipeline-cli.html > > I’d love to gather both technical and non-technical feedbacks regarding > this question. If you have any experiences, whether good, mixed, or bad, > with the current approach, please do not hesitate to share them. > Additionally, if you have experience with other workflow technologies, it > would be valuable to hear about those as well. > > Thanks a lot for your help, > > Bertil > >