Good morning, We are ingesting a very large dataset into our database using Beam on Spark. The dataset is available through a REST-like API and is splicedin such a way so that in order to obtain the whole dataset, we must do around 24000 API calls.
All in all, this results in 24000 CSV files that need to be parsed then written to our database. Unfortunately, we are encountering some OutOfMemoryErrors along the way. >From what we have gathered, this is due to the data being queued between transforms in the pipeline. In order to mitigate this, we have tried to implement a streaming-scheme where the requests streamed to the request executor, the flows to the database. This too produced the OOM-error. What are the best ways of implementing such pipelines so as to minimize the memory footprint? Are there any differences between runners we should be aware of here? (e.g. between Dataflow and Spark)