Questions Flink DataStream in BATCH execution mode scalability advice

Marco Villalobos Tue, 18 May 2021 23:03:43 -0700

Questions Flink DataStream in BATCH execution mode scalability advice.

Here is the problem that I am trying to solve.


Input is an S3 bucket directory with about 500 GB of data across many
files. The instance that I am running on only has 50GB of EBS storage. The
nature of this data is time series data. Imagine name, value, timestamp.

I must average the time_series.value by time_series.name on a tumbling
window of 15 minutes. Upon aggregation, the time_series.timestamp gets
rounded up a quarter.  I key by tag name and 15-minute interval.

After aggregation, I must forward fill the missing quarters for each
time_series.name. Currently, this forward fill operator is keyed only by
time_series.name. Does this mean that in batch mode, all of the time series
with the same time_series.name within the 500 gb of files must fit in
memory?

The results are saved in a rdbms.

If this job somehow reads all 500 GB before it sends it to the first
operator, where is the data store?

Now considering that the EMR node only has 50GB of ebs (that's disk
storage), is there a means to configure Flink to store its intermediate
results within S3?

When the job failed, I saw this exception in the log: "Recovery is
suppressed by NoRestartBackoffTimeStrategy." Is there a way to configure
this to recover?

My job keeps on failing for the same reason, it says, "The heartbeat of
TaskManager with id container_xxx timed out." Is there a way to configure
it not to timeout?

I would appreciate any advice on how I should save these problems. Thank
you.

Questions Flink DataStream in BATCH execution mode scalability advice

Reply via email to