I have several times the same thoughts that Dromit: (Kafka cluster is a expensiver over cost)
Can someone check this Ideas? Case 1: You dont need Replay, Exact One: - All values have time-stamp - Data Source insert the WaterMark in the Source. Some code example ? Case 2: Your DataSource is other PC (with HardDIsk) or IoT (raspberry pi + Memory Card) (DS can hold 15 min of last data easily) - All values arrive time Stamp - The source Insert WM , some example ? - Flink's sinks sent to DataSource Feedbak of WM --> Datasource can free procesed data in Memory Card I'm Crazy .... :) ? 2016-11-16 8:23 GMT+01:00 Tzu-Li (Gordon) Tai <tzuli...@apache.org>: > Hi Matt, > > Here’s an example of writing a DeserializationSchema for your POJOs: [1]. > > As for simply writing messages from WebSocket to Kafka using a Flink job, > while it is absolutely viable, I would not recommend it, > mainly because you’d never know if you might need to temporarily shut down > Flink jobs (perhaps for a version upgrade). > > Shutting down the WebSocket consuming job, would then, of course, lead to > missing messages during the shutdown time. > It would be perhaps simpler if you have a separate Kafka producer > application to directly ingest messages from the WebSocket to Kafka. > You wouldn’t want this application to be down at all, so that all messages > can safely land into Kafka first. I would recommend to keep this part > as simple as possible. > > From there, like Till explained, your Flink processing pipelines can rely > on Kafka’s replayability to provide exactly-once processing guarantees on > your data. > > Best, > Gordon > > > [1] https://github.com/dataArtisans/flink-training- > exercises/blob/master/src/main/java/com/dataartisans/ > flinktraining/exercises/datastream_java/utils/TaxiRideSchema.java > > > > > On November 16, 2016 at 1:07:12 PM, Dromit (dromitl...@gmail.com) wrote: > > "So in your case I would directly ingest my messages into Kafka" > > I will do that through a custom SourceFunction that reads the messages > from the WebSocket, creates simple java objects (POJOs) and sink them in a > Kafka topic using a FlinkKafkaProducer, if that makes sense. > > The problem now is I need a DeserializationSchema for my class. I read > Flink is able to de/serialize POJO objects by its own, but I'm still > required to provide a serializer to create the FlinkKafkaProducer (and > FlinkKafkaConsumer). > > Any idea or example? Should I create a DeserializationSchema for each > POJO class I want to put into a Kafka stream? > > > > On Tue, Nov 15, 2016 at 7:43 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Matt, >> >> as you've stated Flink is a stream processor and as such it needs to get >> its inputs from somewhere. Flink can provide you up to exactly-once >> processing guarantees. But in order to do this, it requires a re-playable >> source because in case of a failure you might have to reprocess parts of >> the input you had already processed prior to the failure. Kafka is such a >> source and people use it because it happens to be one of the most popular >> and widespread open source message queues/distributed logs. >> >> If you don't require strong processing guarantees, then you can simply >> use the WebSocket source. But, for any serious use case, you probably want >> to have these guarantees because otherwise you just might calculate bogus >> results. So in your case I would directly ingest my messages into Kafka and >> then let Flink read from the created topic to do the processing. >> >> Cheers, >> Till >> >> On Tue, Nov 15, 2016 at 8:14 AM, Dromit <dromitl...@gmail.com> wrote: >> >>> Hello, >>> >>> As far as I've seen, there are a lot of projects using Flink and Kafka >>> together, but I'm not seeing the point of that. Let me know what you think >>> about this. >>> >>> 1. If I'm not wrong, Kafka provides basically two things: storage >>> (records retention) and fault tolerance in case of failure, while Flink >>> mostly cares about the transformation of such records. That means I can >>> write a pipeline with Flink alone, and even distribute it on a cluster, but >>> in case of failure some records may be lost, or I won't be able to >>> reprocess the data if I change the code, since the records are not kept in >>> Flink by default (only when sinked properly). Is that right? >>> >>> 2. In my use case the records come from a WebSocket and I create a >>> custom class based on messages on that socket. Should I put those records >>> inside a Kafka topic right away using a Flink custom source >>> (SourceFunction) with a Kafka sink (FlinkKafkaProducer), and independently >>> create a Kafka source (KafkaConsumer) for that topic and pipe the Flink >>> transformations there? Is that data flow fine? >>> >>> Basically what I'm trying to understand with both question is how and >>> why people are using Flink and Kafka. >>> >>> Regards, >>> Matt >>> >> >> >