Re: S3 file source - continuous monitoring - many files missed

Fabian Hueske Tue, 24 Jul 2018 06:08:28 -0700

Hi,

The problem is that Flink tracks which files it has read by remembering the
modification time of the file that was added (or modified) last.
We use the modification time, to avoid that we have to remember the names
of all files that were ever consumed, which would be expensive to check and
store over time.


One could change this logic to a hybrid approach that keeps the names of
all files that have a mod timestamp that is larger than the max mod time
minus an offset.

It would be great if you could open a Jira issue for this problem.

Thanks, Fabian


2018-07-24 14:58 GMT+02:00 Averell <lvhu...@gmail.com>:

> Hello Jörn.
>
> Thanks for your help.
> "/Probably the system is putting them to the folder and Flink is triggered
> before they are consistent./" <<< yes, I also guess so. However, if Flink
> is
> triggered before they are consistent, either (a) there should be some error
> messages, or (b) Flink should be able to identify those files in the
> subsequent triggers. But in my case, those files are missed forever.
>
> Right now those files for S3 are to be consumed by Flink only. The flow is
> as follow:
>            Existing system >>> S3 >>> Flink >>> Elastic Search.
> If I cannot find a solution to the mentioned problem, I might need to
> change
> to:
>            Existing system >>> Kinesis >>> Flink >>> Elastic Search
> Or
>            Existing system >>> S3 >>> Kinesis >>> Flink >>> Elastic
> Search
> Or
>            Existing system >>> S3 >>> Custom File Source + Flink >>>
> Elastic
> Search
> However, all those solutions would take much more effort.
>
> Thanks!
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: S3 file source - continuous monitoring - many files missed

Reply via email to