Hey all, 

I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch 
of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m 
getting an OutofMemoryError on the Dataflow workers once I reach around 4mm 
rows. 

It seems like given the code that I’m reading inside JDBC IO and the guide here 
(https://beam.apache.org/documentation/io/authoring-overview/#read-transforms 
<https://beam.apache.org/documentation/io/authoring-overview/#read-transforms>) 
that it’s just pulling the data in from the result one-by-one and the emitting 
each output. Considering that this is sort of a limitation of the driver, this 
makes sense, but is there a way I can get around the memory limitation somehow? 
It seems like Dataflow repeatedly tries to create more workers to handle the 
work, but it can’t, which is part of the problem. 

If more info is needed in order to help me sort out what I could do to not run 
into the memory limitations I’m happy to provide it. 


Thanks,

Chet 

Reply via email to