Hi Beamers, We are facing OutOfMemory errors with a streaming pipeline on Dataflow. We are unable to get rid of them, not even with bigger worker instances. Any advice will be appreciated.
The scenario is the following. - Files are being written to a bucket in GCS. A notification is set up on the bucket, so for each file there is a Pub/Sub message - Notifications are consumed by our pipeline - Files from notifications are read by the pipeline - Each file contains several events (there is a quite big fanout) - Events are written to BigQuery (streaming) Everything works fine if there are only few notifications, but if the files are incoming at high rate or if there is a large backlog in the notification subscription, events get stuck in BQ write and later OOM is thrown. Having a larger worker does not work because if the backlog is large, larger instance(s) will throw OOM as well, just later... As far as we can see, there is a checkpointing/reshuffle in BQ write and thats where the elements got stuck. It looks like the pubsub is consuming too many elements and due to fanout it causes OOM when grouping in reshuffle. Is there any way to backpressure the pubsub read? Is it possible to have smaller window in Reshuffle? How does the Reshuffle actually work? Any advice? Thanks in advance, Frantisek
