I haven’t used redshift yet, but I think the idea is the same. In Baremaps a workflow is a way to organise the execution of tasks (importing data, create simplification tables, create materialised views, create indexes, export tiles, etc.). In the DAG, interdependent steps are executed sequentially and independent steps are executed in parallel.
The current implementation of the workflow executor is quite minimalistic and fits in 300LoC. It implements a right to left resolution of the DAG: we start from the leaf node and end with the root. Each step is transformed into a CompletableFuture executed asynchronously in a fixed size thread pool. The context allows to share ressources between the different tasks of the steps of a workflow, such as a datasource with a fixed size connection pool. https://github.com/apache/incubator-baremaps/blob/main/baremaps-core/src/main/java/org/apache/baremaps/workflow/WorkflowExecutor.java Here is the workflow we currently use to create the openstreetmap basemap (here, JavaScript is used as a configuration language instead of JSON): https://github.com/apache/incubator-baremaps/blob/main/basemap/import.js I hope this clarify things a bit. Initially, I strongly hesitated to use an engine such as Apache Beam, but after spending some time on it, I eventually favoured a lighter solution. > On 21 Nov 2023, at 04:55, Josh Fischer <j...@joshfischer.io> wrote: > > To say it in different words to make sure I understand. Is the thinking > that the workflow would execute in a linear fashion (assuming that was what > the redshift pipeline was doing) to prevent needing to keep track of the > state of the job? > > On Mon, Nov 20, 2023 at 7:02 AM Bertil Chapuis <bchap...@gmail.com> wrote: > >> Hello everyone, >> >> It would be great to improve the workflow engine in Baremaps (package: >> org.apache.baremaps.workflow). In the current version, a workflow is a >> directed acyclic graph (DAG) of steps. Each step can have one or more tasks >> executed sequentially or in parallel. The inputs and outputs of the tasks >> are set manually. Some of the outputs (e.g., a table created in a database) >> are not described. Furthermore, some resources (e.g., DataSources) are >> shared across the workflow with a context object, but one task must be >> aware of what another task did to benefit from shared resources. This >> approach is loosely based on GitHub Actions. >> >> A nice improvement would be to remove the notion of step, to >> systematically describe the inputs and outputs of the tasks, and to >> introduce a format in the configuration file to describe the shared >> resources accessed via the context object. This would probably make the >> configuration file of the workflow more difficult to read, but at least, >> everything would be declared in it. The DAG could be inferred from the >> inputs and outputs of the tasks. This new approach would probably be closer >> to what AWS Data Pipeline does. >> >> >> https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-redshift-define-pipeline-cli.html >> >> I’d love to gather both technical and non-technical feedbacks regarding >> this question. If you have any experiences, whether good, mixed, or bad, >> with the current approach, please do not hesitate to share them. >> Additionally, if you have experience with other workflow technologies, it >> would be valuable to hear about those as well. >> >> Thanks a lot for your help, >> >> Bertil >> >>
signature.asc
Description: Message signed with OpenPGP