Re: Flink to ingest from Kafka to HDFS?

Stephan Ewen Sun, 16 Aug 2015 10:57:04 -0700

Hi!

This should definitely be possible in Flink. Pretty much exactly like you
describe it.

You need a custom version of the HDFS sink with some logic when to roll
over to a new file.

You can also make the sink "exactly once" by integrating it with the
checkpointing. For that, you would probably need to keep the current path
and output stream offsets as of the last checkpoint, so you can resume from
that offset and overwrite records to avoid duplicates. If that is not
possible, you would probably buffer records between checkpoints and only
write on checkpoints.

Greetings,
Stephan

On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote:

> Hi,
>
> Did anybody think of (mis-) using Flink streaming as an alternative to
> Apache Flume just for ingesting data from Kafka (or other streaming
> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs
> I assume it should be possible, but Is this a good idea to do?
>
> Flume basically is about consuming data from somewhere, peeking into each
> record and then directing it to a specific directory/file in HDFS reliably.
> I've seen there is a FlumeSink, but would it be possible to get the same
> functionality with
> Flink alone?
>
> I've skimmed through the documentation and found the option to split the
> output by key and the possibility to add multiple sinks. As I understand,
> Flink programs are generally static, so it would not be possible to
> add/remove sinks at runtime?
> So you would need to implement a custom sink directing the records to
> different files based on a key (e.g. date)? Would it be difficult to
> implement things like rolling outputs etc? Or better just use Flume?
>
> Best,
> Hans-Peter
>
>
>

Re: Flink to ingest from Kafka to HDFS?

Reply via email to