Re: HDFS streaming source concerns

Adrian Bednarz Tue, 19 Apr 2022 04:48:19 -0700

Hello,

We are actually working on a similar problem against S3. The checkpointing
thing got me thinking if the checkpoint would indeed succeed with a large
backlog of files. I always imagined that SplitEnumerator lists all
available files and SourceReader is responsible for reading those files
afterwards. In this model I thought checkpoint barriers would be blocked
until the initial files backlog is fully consumed.


In Flink UI though I can see that checkpoints are happening timely when
performing a small experiment against local FS. Can somebody comment on how
this works under the hood?

Adrian

On Fri, Apr 8, 2022 at 8:20 PM Roman Khachatryan <ro...@apache.org> wrote:

> Hi Carlos,
>
> AFAIK, Flink FileSource is capable of checkpointing while reading the
> files (at least in Streaming Mode).
> As for the watermarks, I think FLIP-182 [1] could solve the problem;
> however, it's currently under development.
>
> I'm also pulling in Arvid and Fabian who are more familiar with the
> subject.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources
>
> Regards,
> Roman
>
> On Wed, Apr 6, 2022 at 4:17 PM Carlos Downey <carlos.downey...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > We have an in-house platform that we want to integrate with external
> clients via HDFS. They have lots of existing files and they continuously
> put more data to HDFS. Ideally, we would like to have a Flink job that
> takes care of ingesting data as one of the requirements is to execute SQL
> on top of these files. We looked at existing FileSource implementation but
> we believe this will not be well suited for this use case.
> > - firstly, we'd have to ingest all files initially present on HDFS
> before completing first checkpoint - this is unacceptable for us as we
> would have to reprocess all the files again in case of early job failure.
> Not to mention the state blowing up for aggregations.
> > - secondly, we see now way to establish valid watermark strategy. This
> is a major pain point that we can't find the right answer for. We don't
> want to assume too much about the data itself. In general, the only
> solutions we see require some sort of synchronization across subtasks. On
> the other hand, the simplest strategy is to delay the watermark. In that
> case though we are afraid of accidentally dropping events.
> >
> > Given this, we think about implementing our own file source, have
> someone in the community already tried solving similar problem? If not, any
> suggestions about the concerns we raised would be valuable.
>

Re: HDFS streaming source concerns

Reply via email to