[Question] Processing chunks of data in batch based pipelines

Yomal de Silva Mon, 17 Jul 2023 04:34:39 -0700

Hi all,

I have a pipeline which reads data from a database(postgresql), enrich the
data through a side input and finally publish the results to Kafka.
Currently I am not using the builtin JDBCIO to read the data but I think
there wont be any difference in using that. With my implementation I have
set the fetchsize and pass the data to the next transform to process. I
have 2 questions here,


1. For batch based processing pipelines is there a way to process elements
in chunks rather than reading the entire dataset and loading that to
memory? What I have observed is that it occupies a significant amount of
memory and may even cause OOM exceptions. I am looking for sort of a
backpressure implementation or any other way to stop reading all the data
into memory until some of the records gets processed. I have found the
following answer [1] which states thats not possible, since this answer was
provided some time ago wanted to check if it is still the case.

2. When dealing with side inputs, again does it loads everything into
memory and use the appropriate window to carry out the operation inside a
transform?

Please let me know if you have any solutions for this.

[1]
https://stackoverflow.com/questions/57580362/how-to-manage-backpressure-with-apache-beam

Thank you.

[Question] Processing chunks of data in batch based pipelines

Reply via email to