StreamingFileSink seems to be overwriting existing part files

Bruno Aranda Fri, 29 Mar 2019 08:22:48 -0700

Hi,

One of the main reasons we moved to version 1.7 (and 1.7.2 in particular)
was because of the possibility of using a StreamingFileSink with S3.


We've configured a StreamingFileSink to use a DateTimeBucketAssigner to
bucket by day. It's got a parallelism of 1 and is writing to S3 from an EMR
cluster in AWS.

We ran the job and after a few hours of activity, manually cancelled it
through the jobmanager API. After confirming that a number of "part-0-x"
files existed in S3 at the expected path, we then started the job again
using the same invocation of the CLI "flink run..." command that was
originally used to start it.

It started writing data to S3 again, starting afresh from "part-0-0", which
gradually overwrote the existing data.

I can understand not having used a checkpoint gives no indication on where
to resume, but the fact that it overwrites the existing files (as it starts
to write to part-0.0 again) is surprising. One would expect that it finds
the last part and gets the next free number?

We're definitely using the flink-s3-fs-hadoop-1.7.2.jar and don't have the
presto version on the classpath.

Is this its expected behaviour? We have not seen this in the non streaming
versions of the sink.

Best regards,

Bruno

StreamingFileSink seems to be overwriting existing part files

Reply via email to