Re: Flink to ingest from Kafka to HDFS?

Rico Bergmann Thu, 20 Aug 2015 00:45:07 -0700

I'm thinking about implementing this. 

After looking into the flink code I would basically subclass FileOutputFormat 
in let's say KeyedFileOutputFormat, that gets an additional KeySelector object. 
The path in the file system is then appended by the string, the KeySelector 
returns.


U think this is a good approach?

Greets. Rico. 



> Am 16.08.2015 um 19:56 schrieb Stephan Ewen <se...@apache.org>:
> 
> If you are up for it, this would be a very nice addition to Flink, a great 
> contribution :-)
> 
>> On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote:
>> Hi!
>> 
>> This should definitely be possible in Flink. Pretty much exactly like you 
>> describe it.
>> 
>> You need a custom version of the HDFS sink with some logic when to roll over 
>> to a new file.
>> 
>> You can also make the sink "exactly once" by integrating it with the 
>> checkpointing. For that, you would probably need to keep the current path 
>> and output stream offsets as of the last checkpoint, so you can resume from 
>> that offset and overwrite records to avoid duplicates. If that is not 
>> possible, you would probably buffer records between checkpoints and only 
>> write on checkpoints.
>> 
>> Greetings,
>> Stephan
>> 
>> 
>> 
>>> On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote:
>>> Hi,
>>> 
>>> Did anybody think of (mis-) using Flink streaming as an alternative to 
>>> Apache Flume just for ingesting data from Kafka (or other streaming 
>>> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs 
>>> I assume it should be possible, but Is this a good idea to do? 
>>> 
>>> Flume basically is about consuming data from somewhere, peeking into each 
>>> record and then directing it to a specific directory/file in HDFS reliably. 
>>> I've seen there is a FlumeSink, but would it be possible to get the same 
>>> functionality with
>>> Flink alone?
>>> 
>>> I've skimmed through the documentation and found the option to split the 
>>> output by key and the possibility to add multiple sinks. As I understand, 
>>> Flink programs are generally static, so it would not be possible to 
>>> add/remove sinks at runtime?
>>> So you would need to implement a custom sink directing the records to 
>>> different files based on a key (e.g. date)? Would it be difficult to 
>>> implement things like rolling outputs etc? Or better just use Flume?
>>> 
>>> Best, 
>>> Hans-Peter
>

Re: Flink to ingest from Kafka to HDFS?

Reply via email to