Good morning,

We are ingesting a very large dataset into our database using Beam on
Spark. The dataset is available through a REST-like API and is splicedin
such a way so that in order to obtain the whole dataset, we must do around
24000 API calls.

All in all, this results in 24000 CSV files that need to be parsed then
written to our database.

Unfortunately, we are encountering some OutOfMemoryErrors along the way.
>From what we have gathered, this is due to the data being queued between
transforms in the pipeline. In order to mitigate this, we have tried to
implement a streaming-scheme where the requests streamed to the request
executor, the flows to the database. This too produced the OOM-error.

What are the best ways of implementing such pipelines so as to minimize the
memory footprint? Are there any differences between runners we should be
aware of here? (e.g. between Dataflow and Spark)

Reply via email to