Re: HDFS streaming source concerns

Roman Khachatryan Fri, 08 Apr 2022 11:20:51 -0700

Hi Carlos,

AFAIK, Flink FileSource is capable of checkpointing while reading the
files (at least in Streaming Mode).
As for the watermarks, I think FLIP-182 [1] could solve the problem;
however, it's currently under development.


I'm also pulling in Arvid and Fabian who are more familiar with the subject.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources

Regards,
Roman

On Wed, Apr 6, 2022 at 4:17 PM Carlos Downey <carlos.downey...@gmail.com> wrote:
>
> Hi,
>
> We have an in-house platform that we want to integrate with external clients 
> via HDFS. They have lots of existing files and they continuously put more 
> data to HDFS. Ideally, we would like to have a Flink job that takes care of 
> ingesting data as one of the requirements is to execute SQL on top of these 
> files. We looked at existing FileSource implementation but we believe this 
> will not be well suited for this use case.
> - firstly, we'd have to ingest all files initially present on HDFS before 
> completing first checkpoint - this is unacceptable for us as we would have to 
> reprocess all the files again in case of early job failure. Not to mention 
> the state blowing up for aggregations.
> - secondly, we see now way to establish valid watermark strategy. This is a 
> major pain point that we can't find the right answer for. We don't want to 
> assume too much about the data itself. In general, the only solutions we see 
> require some sort of synchronization across subtasks. On the other hand, the 
> simplest strategy is to delay the watermark. In that case though we are 
> afraid of accidentally dropping events.
>
> Given this, we think about implementing our own file source, have someone in 
> the community already tried solving similar problem? If not, any suggestions 
> about the concerns we raised would be valuable.

Re: HDFS streaming source concerns

Reply via email to