Re: Thoughts About Streaming

2015-07-23 Thread Aljoscha Krettek
Yes, this can be changed. After all, this is only a design document and meant to be discussed here and then changed. :D For 1) IMHO an ordering does not make sense if you don't set a partitioning (what keyBy() basically does) because elements can be in arbitrary partitions and then the sorting is

Re: Thoughts About Streaming

2015-07-23 Thread Matthias J. Sax
Thanks for clarifying. For map/flapMap it makes sense. However, I would handle filter differently. Even if the re-grouping can be removed in an optimization step internally -- what might also be possible for map/flatMap -- I think it is annoying for the user to specify .byKey() twice... -> stream

Re: Thoughts About Streaming

2015-07-23 Thread Aljoscha Krettek
Hi, I'll try to answer these one at a time: 1) After a map/flatMap the key (and partitioning) of the data is lost, that's why it goes back to a vanilla DataStream. I think filter also has this behavior just to fit in with the other ones. Also, any chain of filter/map/flatMap can also be expressed

Re: Thoughts About Streaming

2015-07-23 Thread Matthias J. Sax
I just had a look into the "Streams+and+Operations+on+Streams" document. The initial figure is confusing... (it makes sense after reading the document but is a bumper in the beginning) A few comments/question: 1) Why is a (Ordered)KeyedDataStream converted into a DataStream if map/flatMap/filte

Re: Thoughts About Streaming

2015-07-23 Thread Aljoscha Krettek
Sure, good thing is that the discretization part is quite orthogonal to the rest. :D On Thu, 23 Jul 2015 at 10:58 Gyula Fóra wrote: > I think aside from the Discretization part we reached a consensus. I think > you can start with the implementation for the rest. > > I will do some updates to the

Re: Thoughts About Streaming

2015-07-23 Thread Gyula Fóra
I think aside from the Discretization part we reached a consensus. I think you can start with the implementation for the rest. I will do some updates to the Discretization part, and might even start a new doc if it gets too long. Gyula Aljoscha Krettek ezt írta (időpont: 2015. júl. 23., Cs, 10:

Re: Thoughts About Streaming

2015-07-23 Thread Aljoscha Krettek
What's the status of the discussion? What are the opinions on the reworking of the Streaming API as presented here: https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams If we could reach a consensus I would like to start working on this to have it done before the nex

Re: Thoughts About Streaming

2015-06-29 Thread Stephan Ewen
@Matthias: I think using the KeyedDataStream will simply result in smaller programs. May be hard for some users to make the connection to a 1-element-tumbling-window, simply because they want to use state. Not everyone is a deep into that stuff as you are ;-) On Sun, Jun 28, 2015 at 1:13 AM, Mat

Re: Thoughts About Streaming

2015-06-27 Thread Matthias J. Sax
Yes. But as I said, you can get the same behavior with a GroupedDataStream using a tumbling 1-tuple-size window. Thus, there is no conceptual advantage in using KeyedDataStream and no disadvantage in binding stateful operations to GroupedDataStreams. On 06/27/2015 06:54 PM, Márton Balassi wrote: >

Re: Thoughts About Streaming

2015-06-27 Thread Márton Balassi
@Matthias: Your point of working with a minimal number of clear concepts is desirable to say the least. :) The reasoning behind the KeyedDatastream is to associate Flink persisted operator state with the keys of the data that produced it, so that stateful computation becomes scalabe in the future.

Re: Thoughts About Streaming

2015-06-27 Thread Matthias J. Sax
This was more a conceptual point-of-view argument. From an implementation point of view, skipping the window building step is a good idea if a tumbling 1-tuple-size window is detected. I prefer to work with a minimum number of concepts (and apply internal optimization if possible) instead of using

Re: Thoughts About Streaming

2015-06-27 Thread Aljoscha Krettek
What do you mean by Comment 2? Using the whole window apparatus if you just want to have, for example, a simple partitioned filter with partitioned state seems a bit extravagant. On Sat, 27 Jun 2015 at 15:19 Matthias J. Sax wrote: > Nice starting point. > > Comment 1: > "Each individual stream p

Re: Thoughts About Streaming

2015-06-27 Thread Matthias J. Sax
Nice starting point. Comment 1: "Each individual stream partition delivers elements strictly in order." (in 'Parallel Streams, Partitions, Time, and Ordering') I would say "FIFO" and not "strictly in order". If data is not emitted in-order, the stream partition will not be in-order either. Comme

Re: Thoughts About Streaming

2015-06-26 Thread Aljoscha Krettek
I like the bit about the API a lot. What I don't see yet is how delta window can work in a distributed way with out-of-order elements. On Fri, 26 Jun 2015 at 19:43 Stephan Ewen wrote: > Here is a first bit of what I have been writing down. Will add more over > the next days: > > > https://cwiki.

Re: Thoughts About Streaming

2015-06-26 Thread Stephan Ewen
Here is a first bit of what I have been writing down. Will add more over the next days: https://cwiki.apache.org/confluence/display/FLINK/Stream+Windows https://cwiki.apache.org/confluence/display/FLINK/Parallel+Streams%2C+Partitions%2C+Time%2C+and+Ordering On Thu, Jun 25, 2015 at 6:35 PM, Pa

Re: Thoughts About Streaming

2015-06-25 Thread Paris Carbone
+1 for writing this down > On 25 Jun 2015, at 18:11, Aljoscha Krettek wrote: > > +1 go ahead > > On Thu, 25 Jun 2015 at 18:02 Stephan Ewen wrote: > >> Hey! >> >> This thread covers many different topics. Lets break this up into separate >> discussions. >> >> - Operator State is already driv

Re: Thoughts About Streaming

2015-06-25 Thread Aljoscha Krettek
+1 go ahead On Thu, 25 Jun 2015 at 18:02 Stephan Ewen wrote: > Hey! > > This thread covers many different topics. Lets break this up into separate > discussions. > > - Operator State is already driven by Gyula and Paris and happening on the > above mentioned pull request and the followup discus

Re: Thoughts About Streaming

2015-06-25 Thread Stephan Ewen
Hey! This thread covers many different topics. Lets break this up into separate discussions. - Operator State is already driven by Gyula and Paris and happening on the above mentioned pull request and the followup discussions. - For windowing, this discussion has brought some results that we s

Re: Thoughts About Streaming

2015-06-25 Thread Matthias J. Sax
Sure. I picked this up. Using the current model for "occurrence time semantics" does not work. I elaborated on this in the past many times (but nobody cared). It is important to make it clear to the user what semantics are supported. Claiming to support "sliding windows" doesn't mean anything; the

Re: Thoughts About Streaming

2015-06-25 Thread Aljoscha Krettek
Yes, I am aware of this requirement and it would also be supported in my proposed model. The problem is, that the "custom timestamp" feature gives the impression that the elements would be windowed according to a user-timestamp. The results, however, are wrong because of the assumption about eleme

Re: Thoughts About Streaming

2015-06-25 Thread Matthias J. Sax
Hi Aljoscha, I like that you are pushing in this direction. However, IMHO you misinterpreter the current approach. It does not assume that tuples arrive in-order; the current approach has no notion about a pre-defined-order (for example, the order in which the event are created). There is only the

Re: Thoughts About Streaming

2015-06-25 Thread Aljoscha Krettek
Yes, now this also processes about 3 mio Elements (Window Size 5 sec, Slide 1 sec) but it still fluctuates a lot between 1 mio. and 5 mio. Performance is not my main concern, however. My concern is that the current model assumes elements to arrive in order, which is simply not true. In your code

Re: Thoughts About Streaming

2015-06-25 Thread Gábor Gévay
I'm very sorry, I had a bug in the InversePreReducer. It should be fixed now. Can you please run it again? I also tried to reproduce some of your performance numbers, but I'm getting only less than 1/10th of yours. For example, in the Tumbling case, Current/Reduce produces only ~10 for me. Do

Re: Thoughts About Streaming

2015-06-25 Thread Aljoscha Krettek
Hi, I also ran the tests on top of PR 856 (inverse reducer) now. The results seem incorrect. When I insert a Thread.sleep(1) in the tuple source, all the previous tests reported around 3600 tuples (Size 5 sec, Slide 1 sec) (Theoretically there would be 5000 tuples in 5 seconds but this is due to ov

Re: Thoughts About Streaming

2015-06-25 Thread Gábor Gévay
Hello, Aljoscha, can you please try the performance test of Current/Reduce with the InversePreReducer in PR 856? (If you just call sum, it will use an InversePreReducer.) It would be an interesting test, because the inverse function optimization really depends on the stream being ordered, and I th

Re: Thoughts About Streaming

2015-06-25 Thread Ufuk Celebi
Thanks for writing this up and comparing to the current implementation. It's great to see that your mockup indicates correct/expected behaviour *and* better performance. :-) Regarding the results for the current mechanism: does this problem affects all window operators? – Ufuk On 25 Jun 2015,

Re: Thoughts About Streaming

2015-06-25 Thread Aljoscha Krettek
I think I'll have to elaborate a bit so I created a proof-of-concept implementation of my Ideas and ran some throughput measurements to alleviate concerns about performance. First, though, I want to highlight again why the current approach does not work with out-of-order elements (which, again, oc

Re: Thoughts About Streaming

2015-06-24 Thread Gyula Fóra
I agree lets separate these topics from each other so we can get faster resolution. There is already a state discussion in the thread we started with Paris. On Wed, Jun 24, 2015 at 9:24 AM Kostas Tzoumas wrote: > I agree with supporting out-of-order out of the box :-), even if this means > a ma

Re: Thoughts About Streaming

2015-06-24 Thread Kostas Tzoumas
I agree with supporting out-of-order out of the box :-), even if this means a major refactoring. This is the right time to refactor the streaming API before we pull it out of beta. I think that this is more important than new features in the streaming API, which can be prioritized once the API is o

Re: Thoughts About Streaming

2015-06-23 Thread Ted Dunning
Out of order is ubiquitous in the real-world. Typically, what happens is that businesses will declare a maximum allowable delay for delayed transactions and will commit to results when that delay is reached. Transactions that arrive later than this cutoff are collected specially as corrections whi

Re: Thoughts About Streaming

2015-06-23 Thread Aljoscha Krettek
I also don't like big changes but sometimes they are necessary. The reason why I'm so adamant about out-of-order processing is that out-of-order elements are not some exception that occurs once in a while; they occur constantly in a distributed system. For example, in this: https://gist.github.com/

Re: Thoughts About Streaming

2015-06-23 Thread Stephan Ewen
What I like a lot about Aljoscha's proposed design is that we need no different code for "system time" vs. "event time". It only differs in where the timestamps are assigned. The OOP approach also gives you the semantics of total ordering without imposing merges on the streams. On Tue, Jun 23, 20

Re: Thoughts About Streaming

2015-06-23 Thread Matthias J. Sax
I agree that there should be multiple alternatives the user(!) can choose from. Partial out-of-order processing works for many/most aggregates. However, if you consider Event-Pattern-Matching, global ordering in necessary (even if the performance penalty might be high). I would also keep "system-t

Re: Thoughts About Streaming

2015-06-23 Thread Gyula Fóra
Hey I think we should not block PRs unnecessarily if your suggested changes might touch them at some point. Also I still think we should not put everything in the Datastream because it will be a huge mess. Also we need to agree on the out of order processing, whether we want it the way you propo

Re: Thoughts About Streaming

2015-06-23 Thread Stephan Ewen
For the windowing designs, we should also have in mind what requirements we have on the way we keep/store the elements (in external stores, Flink managed memory, ...) On Tue, Jun 23, 2015 at 9:55 AM, Aljoscha Krettek wrote: > The reason I posted this now is that we need to think about the API an

Re: Thoughts About Streaming

2015-06-23 Thread Aljoscha Krettek
The reason I posted this now is that we need to think about the API and windowing before proceeding with the PRs of Gabor (inverse reduce) and Gyula (removal of "aggregate" functions on DataStream). For the windowing, I think that the current model does not work for out-of-order processing. Theref

Re: Thoughts About Streaming

2015-06-22 Thread Gyula Fóra
Hi Aljoscha, Thanks for the nice summary, this is a very good initiative. I added some comments to the respective sections (where I didnt fully agree :).). At some point I think it would be good to have a public hangout session on this, which could make a more dynamic discussion. Cheers, Gyula

Thoughts About Streaming

2015-06-22 Thread Aljoscha Krettek
Hi, with people proposing changes to the streaming part I also wanted to throw my hat into the ring. :D During the last few months, while I was getting acquainted with the streaming system, I wrote down some thoughts I had about how things could be improved. Hopefully, they are in somewhat coheren