Hi Mark, You're describing a common problem in handling large data sets. The way to handle that is to open the source (a ResultSet) and destination (the Parquet writer) at the same time, and read&write the records one by one.
The easiest way to handle this is with the Spring JdbcTemplate and a (stateful) RowCallbackHandler written by you. In it, you'll want to use a RowMapper instance to get the POJOs from the ResultSet, and write them to the writer immediately. If needed, you can close the writer and create a new one for a new file every 1M records or so. Kind regards, Oscar -- Oscar Westra van Holthe - Kind <os...@westravanholthe.nl> Op wo 31 jul. 2024 23:39 schreef Mark Lybarger <mlybar...@gmail.com>: > here's my current flow. i have a java program that uses avro schema file > to generate pojos. the code reads data from a postgres table and transfers > the data from the db to a list of the generated pojos. i have 4.5m > records in the db that the process is reading. once the avro pojos are > populated, it then uses the avro writer to output parquet format that is > ingested into our data lake. > > the problem is that as the table keeps growing, we get oom. i'll be > looking at where in the code the oom is coming. continually increasing the > memory isn't a feasible solution. what are some common patterns for > handling this? i'm thinking to chunk the records; is it possible to > process 500k records at a time, then concatenate the parquet files? i'm > pretty new to this. >