If you are up for it, this would be a very nice addition to Flink, a great contribution :-)
On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote: > Hi! > > This should definitely be possible in Flink. Pretty much exactly like you > describe it. > > You need a custom version of the HDFS sink with some logic when to roll > over to a new file. > > You can also make the sink "exactly once" by integrating it with the > checkpointing. For that, you would probably need to keep the current path > and output stream offsets as of the last checkpoint, so you can resume from > that offset and overwrite records to avoid duplicates. If that is not > possible, you would probably buffer records between checkpoints and only > write on checkpoints. > > Greetings, > Stephan > > > > On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote: > >> Hi, >> >> Did anybody think of (mis-) using Flink streaming as an alternative to >> Apache Flume just for ingesting data from Kafka (or other streaming >> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs >> I assume it should be possible, but Is this a good idea to do? >> >> Flume basically is about consuming data from somewhere, peeking into each >> record and then directing it to a specific directory/file in HDFS reliably. >> I've seen there is a FlumeSink, but would it be possible to get the same >> functionality with >> Flink alone? >> >> I've skimmed through the documentation and found the option to split the >> output by key and the possibility to add multiple sinks. As I understand, >> Flink programs are generally static, so it would not be possible to >> add/remove sinks at runtime? >> So you would need to implement a custom sink directing the records to >> different files based on a key (e.g. date)? Would it be difficult to >> implement things like rolling outputs etc? Or better just use Flume? >> >> Best, >> Hans-Peter >> >> >> >