Hey all, I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m getting an OutofMemoryError on the Dataflow workers once I reach around 4mm rows.
It seems like given the code that I’m reading inside JDBC IO and the guide here (https://beam.apache.org/documentation/io/authoring-overview/#read-transforms <https://beam.apache.org/documentation/io/authoring-overview/#read-transforms>) that it’s just pulling the data in from the result one-by-one and the emitting each output. Considering that this is sort of a limitation of the driver, this makes sense, but is there a way I can get around the memory limitation somehow? It seems like Dataflow repeatedly tries to create more workers to handle the work, but it can’t, which is part of the problem. If more info is needed in order to help me sort out what I could do to not run into the memory limitations I’m happy to provide it. Thanks, Chet
