Re: Improving the workflow engine

Bertil Chapuis Tue, 21 Nov 2023 01:59:43 -0800

I haven’t used redshift yet, but I think the idea is the same.

In Baremaps a workflow is a way to organise the execution of tasks (importing 
data, create simplification tables, create materialised views, create indexes, 
export tiles, etc.). In the DAG, interdependent steps are executed sequentially 
and independent steps are executed in parallel.


The current implementation of the workflow executor is quite minimalistic and 
fits in 300LoC. It implements a right to left resolution of the DAG: we start 
from the leaf node and end with the root. Each step is transformed into a 
CompletableFuture executed asynchronously in a fixed size thread pool. The 
context allows to share ressources between the different tasks of the steps of 
a workflow, such as a datasource with a fixed size connection pool.
https://github.com/apache/incubator-baremaps/blob/main/baremaps-core/src/main/java/org/apache/baremaps/workflow/WorkflowExecutor.java

Here is the workflow we currently use to create the openstreetmap basemap 
(here, JavaScript is used as a configuration language instead of JSON):
https://github.com/apache/incubator-baremaps/blob/main/basemap/import.js

I hope this clarify things a bit. Initially, I strongly hesitated to use an 
engine such as Apache Beam, but after spending some time on it, I eventually 
favoured a lighter solution.


> On 21 Nov 2023, at 04:55, Josh Fischer <j...@joshfischer.io> wrote:
> 
> To say it in different words to make sure I understand.  Is the thinking
> that the workflow would execute in a linear fashion (assuming that was what
> the redshift pipeline was doing) to prevent needing to keep track of the
> state of the job?
> 
> On Mon, Nov 20, 2023 at 7:02 AM Bertil Chapuis <bchap...@gmail.com> wrote:
> 
>> Hello everyone,
>> 
>> It would be great to improve the workflow engine in Baremaps (package:
>> org.apache.baremaps.workflow). In the current version, a workflow is a
>> directed acyclic graph (DAG) of steps. Each step can have one or more tasks
>> executed sequentially or in parallel. The inputs and outputs of the tasks
>> are set manually. Some of the outputs (e.g., a table created in a database)
>> are not described. Furthermore, some resources (e.g., DataSources) are
>> shared across the workflow with a context object, but one task must be
>> aware of what another task did to benefit from shared resources. This
>> approach is loosely based on GitHub Actions.
>> 
>> A nice improvement would be to remove the notion of step, to
>> systematically describe the inputs and outputs of the tasks, and to
>> introduce a format in the configuration file to describe the shared
>> resources accessed via the context object. This would probably make the
>> configuration file of the workflow more difficult to read, but at least,
>> everything would be declared in it. The DAG could be inferred from the
>> inputs and outputs of the tasks. This new approach would probably be closer
>> to what AWS Data Pipeline does.
>> 
>> 
>> https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-redshift-define-pipeline-cli.html
>> 
>> I’d love to gather both technical and non-technical feedbacks regarding
>> this question. If you have any experiences, whether good, mixed, or bad,
>> with the current approach, please do not hesitate to share them.
>> Additionally, if you have experience with other workflow technologies, it
>> would be valuable to hear about those as well.
>> 
>> Thanks a lot for your help,
>> 
>> Bertil
>> 
>>

signature.asc
Description: Message signed with OpenPGP

Re: Improving the workflow engine

Reply via email to