I'm thinking about implementing this. After looking into the flink code I would basically subclass FileOutputFormat in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. The path in the file system is then appended by the string, the KeySelector returns.
U think this is a good approach? Greets. Rico. > Am 16.08.2015 um 19:56 schrieb Stephan Ewen <se...@apache.org>: > > If you are up for it, this would be a very nice addition to Flink, a great > contribution :-) > >> On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote: >> Hi! >> >> This should definitely be possible in Flink. Pretty much exactly like you >> describe it. >> >> You need a custom version of the HDFS sink with some logic when to roll over >> to a new file. >> >> You can also make the sink "exactly once" by integrating it with the >> checkpointing. For that, you would probably need to keep the current path >> and output stream offsets as of the last checkpoint, so you can resume from >> that offset and overwrite records to avoid duplicates. If that is not >> possible, you would probably buffer records between checkpoints and only >> write on checkpoints. >> >> Greetings, >> Stephan >> >> >> >>> On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote: >>> Hi, >>> >>> Did anybody think of (mis-) using Flink streaming as an alternative to >>> Apache Flume just for ingesting data from Kafka (or other streaming >>> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs >>> I assume it should be possible, but Is this a good idea to do? >>> >>> Flume basically is about consuming data from somewhere, peeking into each >>> record and then directing it to a specific directory/file in HDFS reliably. >>> I've seen there is a FlumeSink, but would it be possible to get the same >>> functionality with >>> Flink alone? >>> >>> I've skimmed through the documentation and found the option to split the >>> output by key and the possibility to add multiple sinks. As I understand, >>> Flink programs are generally static, so it would not be possible to >>> add/remove sinks at runtime? >>> So you would need to implement a custom sink directing the records to >>> different files based on a key (e.g. date)? Would it be difficult to >>> implement things like rolling outputs etc? Or better just use Flume? >>> >>> Best, >>> Hans-Peter >