Seeking suggestions for ingesting large amount of data from S3

Yang Liu Thu, 09 Feb 2023 13:11:42 -0800

Hi all,

We are trying to ingest large amounts of data (20TB) from S3 using Flink
filesystem connector to bootstrap a Hudi table. Data are well partitioned
in S3 by date/time, but we have been facing OOM issues in Flink jobs, so we
wanted to update the Flink job to ingest the data chunk by chuck (partition
by partition) by some kind of looping instead of all at once. Curious
what’s the recommended way to do this in Flink. I believe this should be a
common use case, so hope to get some ideas here.


We have been using Table APIs, but open to other APIs.

Thanks & Regards
Eric

Seeking suggestions for ingesting large amount of data from S3

Reply via email to