Hi Mark,

You're describing a common problem in handling large data sets. The way to
handle that is to open the source (a ResultSet) and destination (the
Parquet writer) at the same time, and read&write the records one by one.

The easiest way to handle this is with the Spring JdbcTemplate and a
(stateful) RowCallbackHandler written by you. In it, you'll want to use a
RowMapper instance to get the POJOs from the ResultSet, and write them to
the writer immediately.

If needed, you can close the writer and create a new one for a new file
every 1M records or so.

Kind regards,
Oscar
-- 
Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Op wo 31 jul. 2024 23:39 schreef Mark Lybarger <mlybar...@gmail.com>:

> here's my current flow.  i have a java program that uses avro schema file
> to generate pojos.  the code reads data from a postgres table and transfers
> the data  from the db to a list of the generated pojos.  i have 4.5m
> records in the db that the process is reading.  once the avro pojos are
> populated, it then uses the avro writer to output parquet format that is
> ingested into our data lake.
>
> the problem is that as the table keeps growing, we get oom. i'll be
> looking at where in the code the oom is coming.  continually increasing the
> memory isn't a feasible solution. what are some common patterns for
> handling this?  i'm thinking to chunk the records; is it possible to
> process 500k records at a time, then concatenate the parquet files? i'm
> pretty new to this.
>

Reply via email to