Hi! This should definitely be possible in Flink. Pretty much exactly like you describe it.
You need a custom version of the HDFS sink with some logic when to roll over to a new file. You can also make the sink "exactly once" by integrating it with the checkpointing. For that, you would probably need to keep the current path and output stream offsets as of the last checkpoint, so you can resume from that offset and overwrite records to avoid duplicates. If that is not possible, you would probably buffer records between checkpoints and only write on checkpoints. Greetings, Stephan On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote: > Hi, > > Did anybody think of (mis-) using Flink streaming as an alternative to > Apache Flume just for ingesting data from Kafka (or other streaming > sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs > I assume it should be possible, but Is this a good idea to do? > > Flume basically is about consuming data from somewhere, peeking into each > record and then directing it to a specific directory/file in HDFS reliably. > I've seen there is a FlumeSink, but would it be possible to get the same > functionality with > Flink alone? > > I've skimmed through the documentation and found the option to split the > output by key and the possibility to add multiple sinks. As I understand, > Flink programs are generally static, so it would not be possible to > add/remove sinks at runtime? > So you would need to implement a custom sink directing the records to > different files based on a key (e.g. date)? Would it be difficult to > implement things like rolling outputs etc? Or better just use Flume? > > Best, > Hans-Peter > > >