Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
True, flatMap does not have access to watermarks. You can also go a bit more to the low levels and directly implement an AbstractStreamOperator with OneInputStreamOperatorInterface. This is kind of the base class for the built-in stream operators and it has access to Watermarks (OneInputStreamOper

Re: General Data questions - streams vs batch

2016-04-28 Thread Konstantin Kulagin
Thanks Fabian, works like a charm except the case when the stream is finite (or i have a dataset from the beginning). In this case I need somehow identify that stream is finished and emit latest batch (which might have less amount of elements) to output. What is the best way to do that? In stream

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
Hi Konstantin, if you do not need a deterministic grouping of elements you should not use a keyed stream or window. Instead you can do the lookups in a parallel flatMap function. The function would collect arriving elements and perform a lookup query after a certain number of elements arrived (can

Re: General Data questions - streams vs batch

2016-04-25 Thread Konstantin Kulagin
As usual - thanks for answers, Aljoscha! I think I understood what I want to know. 1) To add few comments: about streams I was thinking about something similar to Storm where you can have one Source and 'duplicate' the same entry going through different 'path's. Something like this: https://docs.

Re: General Data questions - streams vs batch

2016-04-25 Thread Aljoscha Krettek
Hi, I'll try and answer your questions separately. First, a general remark, although Flink has the DataSet API for batch processing and the DataStream API for stream processing we only have one underlying streaming execution engine that is used for both. Now, regarding the questions: 1) What do yo

General Data questions - streams vs batch

2016-04-24 Thread Konstantin Kulagin
Hi guys, I have some kind of general question in order to get more understanding of stream vs final data transformation. More specific - I am trying to understand 'entities' lifecycle during processing. 1) For example in a case of streams: suppose we start with some key-value source, parallel it