Re: Why use Kafka after all?

Alberto Ramón Wed, 16 Nov 2016 08:32:30 -0800

I have several times the same thoughts that Dromit:
(Kafka cluster is a expensiver over cost)


Can someone check this Ideas?

Case 1: You dont need Replay, Exact One:
  - All values have time-stamp
  - Data Source insert the WaterMark in the Source. Some code example ?

Case 2: Your DataSource is other PC (with HardDIsk) or IoT (raspberry pi +
Memory Card) (DS can hold 15 min of last data easily)
  - All values arrive time Stamp
  - The source Insert WM  , some example ?
  - Flink's sinks sent to DataSource Feedbak of WM --> Datasource can free
procesed data in Memory Card

I'm Crazy .... :) ?

2016-11-16 8:23 GMT+01:00 Tzu-Li (Gordon) Tai <tzuli...@apache.org>:

> Hi Matt,
>
> Here’s an example of writing a DeserializationSchema for your POJOs: [1].
>
> As for simply writing messages from WebSocket to Kafka using a Flink job,
> while it is absolutely viable, I would not recommend it,
> mainly because you’d never know if you might need to temporarily shut down
> Flink jobs (perhaps for a version upgrade).
>
> Shutting down the WebSocket consuming job, would then, of course, lead to
> missing messages during the shutdown time.
> It would be perhaps simpler if you have a separate Kafka producer
> application to directly ingest messages from the WebSocket to Kafka.
> You wouldn’t want this application to be down at all, so that all messages
> can safely land into Kafka first. I would recommend to keep this part
> as simple as possible.
>
> From there, like Till explained, your Flink processing pipelines can rely
> on Kafka’s replayability to provide exactly-once processing guarantees on
> your data.
>
> Best,
> Gordon
>
>
> [1] https://github.com/dataArtisans/flink-training-
> exercises/blob/master/src/main/java/com/dataartisans/
> flinktraining/exercises/datastream_java/utils/TaxiRideSchema.java
>
>
>
>
> On November 16, 2016 at 1:07:12 PM, Dromit (dromitl...@gmail.com) wrote:
>
> "So in your case I would directly ingest my messages into Kafka"
>
> I will do that through a custom SourceFunction that reads the messages
> from the WebSocket, creates simple java objects (POJOs) and sink them in a
> Kafka topic using a FlinkKafkaProducer, if that makes sense.
>
> The problem now is I need a DeserializationSchema for my class. I read
> Flink is able to de/serialize POJO objects by its own, but I'm still
> required to provide a serializer to create the FlinkKafkaProducer (and
> FlinkKafkaConsumer).
>
> Any idea or example? Should I create a DeserializationSchema for each
> POJO class I want to put into a Kafka stream?
>
>
>
> On Tue, Nov 15, 2016 at 7:43 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Matt,
>>
>> as you've stated Flink is a stream processor and as such it needs to get
>> its inputs from somewhere. Flink can provide you up to exactly-once
>> processing guarantees. But in order to do this, it requires a re-playable
>> source because in case of a failure you might have to reprocess parts of
>> the input you had already processed prior to the failure. Kafka is such a
>> source and people use it because it happens to be one of the most popular
>> and widespread open source message queues/distributed logs.
>>
>> If you don't require strong processing guarantees, then you can simply
>> use the WebSocket source. But, for any serious use case, you probably want
>> to have these guarantees because otherwise you just might calculate bogus
>> results. So in your case I would directly ingest my messages into Kafka and
>> then let Flink read from the created topic to do the processing.
>>
>> Cheers,
>> Till
>>
>> On Tue, Nov 15, 2016 at 8:14 AM, Dromit <dromitl...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> As far as I've seen, there are a lot of projects using Flink and Kafka
>>> together, but I'm not seeing the point of that. Let me know what you think
>>> about this.
>>>
>>> 1. If I'm not wrong, Kafka provides basically two things: storage
>>> (records retention) and fault tolerance in case of failure, while Flink
>>> mostly cares about the transformation of such records. That means I can
>>> write a pipeline with Flink alone, and even distribute it on a cluster, but
>>> in case of failure some records may be lost, or I won't be able to
>>> reprocess the data if I change the code, since the records are not kept in
>>> Flink by default (only when sinked properly). Is that right?
>>>
>>> 2. In my use case the records come from a WebSocket and I create a
>>> custom class based on messages on that socket. Should I put those records
>>> inside a Kafka topic right away using a Flink custom source
>>> (SourceFunction) with a Kafka sink (FlinkKafkaProducer), and independently
>>> create a Kafka source (KafkaConsumer) for that topic and pipe the Flink
>>> transformations there? Is that data flow fine?
>>>
>>> Basically what I'm trying to understand with both question is how and
>>> why people are using Flink and Kafka.
>>>
>>> Regards,
>>> Matt
>>>
>>
>>
>

Re: Why use Kafka after all?

Reply via email to