If you are up for it, this would be a very nice addition to Flink, a great
contribution :-)

On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> This should definitely be possible in Flink. Pretty much exactly like you
> describe it.
>
> You need a custom version of the HDFS sink with some logic when to roll
> over to a new file.
>
> You can also make the sink "exactly once" by integrating it with the
> checkpointing. For that, you would probably need to keep the current path
> and output stream offsets as of the last checkpoint, so you can resume from
> that offset and overwrite records to avoid duplicates. If that is not
> possible, you would probably buffer records between checkpoints and only
> write on checkpoints.
>
> Greetings,
> Stephan
>
>
>
> On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com> wrote:
>
>> Hi,
>>
>> Did anybody think of (mis-) using Flink streaming as an alternative to
>> Apache Flume just for ingesting data from Kafka (or other streaming
>> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs
>> I assume it should be possible, but Is this a good idea to do?
>>
>> Flume basically is about consuming data from somewhere, peeking into each
>> record and then directing it to a specific directory/file in HDFS reliably.
>> I've seen there is a FlumeSink, but would it be possible to get the same
>> functionality with
>> Flink alone?
>>
>> I've skimmed through the documentation and found the option to split the
>> output by key and the possibility to add multiple sinks. As I understand,
>> Flink programs are generally static, so it would not be possible to
>> add/remove sinks at runtime?
>> So you would need to implement a custom sink directing the records to
>> different files based on a key (e.g. date)? Would it be difficult to
>> implement things like rolling outputs etc? Or better just use Flume?
>>
>> Best,
>> Hans-Peter
>>
>>
>>
>

Reply via email to