Re: Flink to ingest from Kafka to HDFS?

Stephan Ewen Thu, 20 Aug 2015 03:13:51 -0700

BTW: This is becoming a dev discussion, maybe should move to that list...

On Thu, Aug 20, 2015 at 12:12 PM, Stephan Ewen <se...@apache.org> wrote:


> Yes, one needs exactly a mechanism to seek the output stream back to the
> last checkpointed position, in order to overwrite duplicates.
>
> I think HDFS is not going to make this easy, there is basically no seek
> for write. Not sure how to solve this, other then writing to tmp files and
> copying upon success.
>
> Apache Flume must have solved this issue in some way, it may be a worth
> looking into how they solved it.
>
> On Thu, Aug 20, 2015 at 11:58 AM, Rico Bergmann <i...@ricobergmann.de>
> wrote:
>
>> My ideas for checkpointing:
>>
>> I think writing to the destination should not depend on the checkpoint
>> mechanism (otherwise the output would never be written to the destination
>> if checkpointing is disabled). Instead I would keep the offsets of written
>> and Checkpointed records. When recovering you would then somehow delete or
>> overwrite the records after that offset. (But I don't really know whether
>> this is as simple as I wrote it ;-) ).
>>
>> Regarding the rolling files I would suggest making the values of the
>> user-defined partitioning function part of the path or file name. Writing
>> records is then basically:
>> Extract the partition to write to, then add the record to a queue for
>> this partition. Each queue has an output format assigned to it. On flushing
>> the output file is opened, the content of the queue is written to it, and
>> then closed.
>>
>> Does this sound reasonable?
>>
>>
>>
>> Am 20.08.2015 um 10:40 schrieb Aljoscha Krettek <aljos...@apache.org>:
>>
>> Yes, this seems like a good approach. We should probably no reuse the
>> KeySelector for this but maybe a more use-case specific type of function
>> that can create a desired filename from an input object.
>>
>> This is only the first part, though. The hard bit would be implementing
>> rolling files and also integrating it with Flink's checkpointing mechanism.
>> For integration with checkpointing you could maybe use "staging-files": all
>> elements are put into a staging file. And then, when the notification about
>> a completed checkpoint is received the contents of this file would me moved
>> (or appended) to the actual destination.
>>
>> Do you have any Ideas about the rolling files/checkpointing?
>>
>> On Thu, 20 Aug 2015 at 09:44 Rico Bergmann <i...@ricobergmann.de> wrote:
>>
>>> I'm thinking about implementing this.
>>>
>>> After looking into the flink code I would basically subclass
>>> FileOutputFormat in let's say KeyedFileOutputFormat, that gets an
>>> additional KeySelector object. The path in the file system is then appended
>>> by the string, the KeySelector returns.
>>>
>>> U think this is a good approach?
>>>
>>> Greets. Rico.
>>>
>>>
>>>
>>> Am 16.08.2015 um 19:56 schrieb Stephan Ewen <se...@apache.org>:
>>>
>>> If you are up for it, this would be a very nice addition to Flink, a
>>> great contribution :-)
>>>
>>> On Sun, Aug 16, 2015 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> This should definitely be possible in Flink. Pretty much exactly like
>>>> you describe it.
>>>>
>>>> You need a custom version of the HDFS sink with some logic when to roll
>>>> over to a new file.
>>>>
>>>> You can also make the sink "exactly once" by integrating it with the
>>>> checkpointing. For that, you would probably need to keep the current path
>>>> and output stream offsets as of the last checkpoint, so you can resume from
>>>> that offset and overwrite records to avoid duplicates. If that is not
>>>> possible, you would probably buffer records between checkpoints and only
>>>> write on checkpoints.
>>>>
>>>> Greetings,
>>>> Stephan
>>>>
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 7:09 PM, Hans-Peter Zorn <hpz...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Did anybody think of (mis-) using Flink streaming as an alternative to
>>>>> Apache Flume just for ingesting data from Kafka (or other streaming
>>>>> sources) to HDFS? Knowing that Flink can read from Kafka and write to hdfs
>>>>> I assume it should be possible, but Is this a good idea to do?
>>>>>
>>>>> Flume basically is about consuming data from somewhere, peeking into
>>>>> each record and then directing it to a specific directory/file in HDFS
>>>>> reliably. I've seen there is a FlumeSink, but would it be possible to get
>>>>> the same functionality with
>>>>> Flink alone?
>>>>>
>>>>> I've skimmed through the documentation and found the option to split
>>>>> the output by key and the possibility to add multiple sinks. As I
>>>>> understand, Flink programs are generally static, so it would not be
>>>>> possible to add/remove sinks at runtime?
>>>>> So you would need to implement a custom sink directing the records to
>>>>> different files based on a key (e.g. date)? Would it be difficult to
>>>>> implement things like rolling outputs etc? Or better just use Flume?
>>>>>
>>>>> Best,
>>>>> Hans-Peter
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: Flink to ingest from Kafka to HDFS?

Reply via email to