As usual - thanks for answers, Aljoscha! I think I understood what I want to know.
1) To add few comments: about streams I was thinking about something similar to Storm where you can have one Source and 'duplicate' the same entry going through different 'path's. Something like this: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_storm-user-guide/content/figures/1/figures/SpoutsAndBolts.png And later you can 'join' these separate streams back. And actually I think this is what I meant: https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/JoinedStreams.html - this one actually 'joins' by window. As for 'exact-once-guarantee' I've got the difference from this paper: http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink - Thanks! 2) understood, thank you very much I'll probably bother you one more time with another question: 3) Lets say I have a Source which provides raw (i.e. non-keyed) data. And lets say I need to 'enhance' each entry with some fields which I can take from a database. So I define some DbEnhanceOperation Database query might be expensive - so I would want to a) batch entries to perform queries b) be able to have several parallel DbEnhaceOperations so those will not slow down my whole processing. I do not see a way to do that? Problems: I cannot go with countWindowAll because of b) - that thing does not support several streams (correct?) So I need to create a windowed stream and for that I need to have some key - Correct? I.e cannot create windows on a stream of general object just using number of objects. I probably can 'emulate' keyed stream by providing some 'fake' key. But in this case I can parallelize only on different keys. Again - it is probably doable by introducing some AtomicLong key generator at the first place ( this part probably hard to understand - I can share details if necessary) but still looks like a bit of hack :) But the general question - if I can implement 3) 'normally' in a flink-way? Thanks! Konstantin. On Mon, Apr 25, 2016 at 10:53 AM, Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > I'll try and answer your questions separately. First, a general remark, > although Flink has the DataSet API for batch processing and the DataStream > API for stream processing we only have one underlying streaming execution > engine that is used for both. Now, regarding the questions: > > 1) What do you mean by "parallel into 2 streams"? Maybe that could > influence my answer but I'll just give a general answer: Flink does not > give any guarantees about the ordering of elements in a Stream or in a > DataSet. This means that merging or unioning two streams/data sets will > just mean that operations see all elements in the two merged streams but > the order in which we see them is arbitrary. This means that we don't keep > buffers based on time or size or anything. > > 2) The elements that flow through the topology are not tracked > individually, each operation just receives elements, updates state and > sends elements to downstream operation. In essence this means that elements > themselves don't block any resources except if they alter some kept state > in operations. If you have a stateless pipeline that only has > filters/maps/flatMaps then the amount of required resources is very low. > > For a finite data set, elements are also streamed through the topology. > Only if you use operations that require grouping or sorting (such as > groupBy/reduce and join) will elements be buffered in memory or on disk > before they are processed. > > Two answer your last question. If you only do stateless > transformations/filters then you are fine to use either API and the > performance should be similar. > > Cheers, > Aljoscha > > On Sun, 24 Apr 2016 at 15:54 Konstantin Kulagin <kkula...@gmail.com> > wrote: > >> Hi guys, >> >> I have some kind of general question in order to get more understanding >> of stream vs final data transformation. More specific - I am trying to >> understand 'entities' lifecycle during processing. >> >> 1) For example in a case of streams: suppose we start with some key-value >> source, parallel it into 2 streams by key. Each stream modifies entry's >> values, lets say adds some fields. And we want to merge it back later. How >> does it happen? >> Merging point will keep some finite buffer of entries? Basing on time or >> size? >> >> I understand that probably right solution in this case would be having >> one stream and achieve more more performance by increasing parallelism, but >> what if I have 2 sources from the beginning? >> >> >> 2) Also I assume that in a case of streaming each entry considered as >> 'processed' once it passes whole chain and emitted into some sink, so after >> it will not consume resources. Basically similar to what Storm is doing. >> But in a case of finite data (data sets): how big amount of data system >> will keep in memory? The whole set? >> >> I probably have some example of dataset vs stream 'mix': I need to >> *transform* big but finite chunk of data, I don't really need to do any >> 'joins', grouping or smth like that so I never need to store whole dataset >> in memory/storage. What my choice would be in this case? >> >> Thanks! >> Konstantin >> >> >>