Yes, this can be changed. After all, this is only a design document and
meant to be discussed here and then changed. :D
For 1) IMHO an ordering does not make sense if you don't set a partitioning
(what keyBy() basically does) because elements can be in arbitrary
partitions and then the sorting is
Thanks for clarifying.
For map/flapMap it makes sense. However, I would handle filter
differently. Even if the re-grouping can be removed in an optimization
step internally -- what might also be possible for map/flatMap -- I
think it is annoying for the user to specify .byKey() twice...
-> stream
Hi,
I'll try to answer these one at a time:
1) After a map/flatMap the key (and partitioning) of the data is lost,
that's why it goes back to a vanilla DataStream. I think filter also has
this behavior just to fit in with the other ones. Also, any chain of
filter/map/flatMap can also be expressed
I just had a look into the "Streams+and+Operations+on+Streams" document.
The initial figure is confusing... (it makes sense after reading the
document but is a bumper in the beginning)
A few comments/question:
1) Why is a (Ordered)KeyedDataStream converted into a DataStream if
map/flatMap/filte
Sure, good thing is that the discretization part is quite orthogonal to the
rest. :D
On Thu, 23 Jul 2015 at 10:58 Gyula Fóra wrote:
> I think aside from the Discretization part we reached a consensus. I think
> you can start with the implementation for the rest.
>
> I will do some updates to the
I think aside from the Discretization part we reached a consensus. I think
you can start with the implementation for the rest.
I will do some updates to the Discretization part, and might even start a
new doc if it gets too long.
Gyula
Aljoscha Krettek ezt írta (időpont: 2015. júl. 23.,
Cs, 10:
What's the status of the discussion? What are the opinions on the reworking
of the Streaming API as presented here:
https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams
If we could reach a consensus I would like to start working on this to have
it done before the nex
@Matthias:
I think using the KeyedDataStream will simply result in smaller programs.
May be hard for some users to make the connection to a
1-element-tumbling-window, simply because they want to use state. Not
everyone is a deep into that stuff as you are ;-)
On Sun, Jun 28, 2015 at 1:13 AM, Mat
Yes. But as I said, you can get the same behavior with a
GroupedDataStream using a tumbling 1-tuple-size window. Thus, there is
no conceptual advantage in using KeyedDataStream and no disadvantage in
binding stateful operations to GroupedDataStreams.
On 06/27/2015 06:54 PM, Márton Balassi wrote:
>
@Matthias: Your point of working with a minimal number of clear concepts is
desirable to say the least. :)
The reasoning behind the KeyedDatastream is to associate Flink persisted
operator state with the keys of the data that produced it, so that stateful
computation becomes scalabe in the future.
This was more a conceptual point-of-view argument. From an
implementation point of view, skipping the window building step is a
good idea if a tumbling 1-tuple-size window is detected.
I prefer to work with a minimum number of concepts (and apply internal
optimization if possible) instead of using
What do you mean by Comment 2? Using the whole window apparatus if you just
want to have, for example, a simple partitioned filter with partitioned
state seems a bit extravagant.
On Sat, 27 Jun 2015 at 15:19 Matthias J. Sax
wrote:
> Nice starting point.
>
> Comment 1:
> "Each individual stream p
Nice starting point.
Comment 1:
"Each individual stream partition delivers elements strictly in order."
(in 'Parallel Streams, Partitions, Time, and Ordering')
I would say "FIFO" and not "strictly in order". If data is not emitted
in-order, the stream partition will not be in-order either.
Comme
I like the bit about the API a lot. What I don't see yet is how delta
window can work in a distributed way with out-of-order elements.
On Fri, 26 Jun 2015 at 19:43 Stephan Ewen wrote:
> Here is a first bit of what I have been writing down. Will add more over
> the next days:
>
>
> https://cwiki.
Here is a first bit of what I have been writing down. Will add more over
the next days:
https://cwiki.apache.org/confluence/display/FLINK/Stream+Windows
https://cwiki.apache.org/confluence/display/FLINK/Parallel+Streams%2C+Partitions%2C+Time%2C+and+Ordering
On Thu, Jun 25, 2015 at 6:35 PM, Pa
+1 for writing this down
> On 25 Jun 2015, at 18:11, Aljoscha Krettek wrote:
>
> +1 go ahead
>
> On Thu, 25 Jun 2015 at 18:02 Stephan Ewen wrote:
>
>> Hey!
>>
>> This thread covers many different topics. Lets break this up into separate
>> discussions.
>>
>> - Operator State is already driv
+1 go ahead
On Thu, 25 Jun 2015 at 18:02 Stephan Ewen wrote:
> Hey!
>
> This thread covers many different topics. Lets break this up into separate
> discussions.
>
> - Operator State is already driven by Gyula and Paris and happening on the
> above mentioned pull request and the followup discus
Hey!
This thread covers many different topics. Lets break this up into separate
discussions.
- Operator State is already driven by Gyula and Paris and happening on the
above mentioned pull request and the followup discussions.
- For windowing, this discussion has brought some results that we s
Sure. I picked this up. Using the current model for "occurrence time
semantics" does not work.
I elaborated on this in the past many times (but nobody cared). It is
important to make it clear to the user what semantics are supported.
Claiming to support "sliding windows" doesn't mean anything; the
Yes, I am aware of this requirement and it would also be supported in my
proposed model.
The problem is, that the "custom timestamp" feature gives the impression
that the elements would be windowed according to a user-timestamp. The
results, however, are wrong because of the assumption about eleme
Hi Aljoscha,
I like that you are pushing in this direction. However, IMHO you
misinterpreter the current approach. It does not assume that tuples
arrive in-order; the current approach has no notion about a
pre-defined-order (for example, the order in which the event are
created). There is only the
Yes, now this also processes about 3 mio Elements (Window Size 5 sec, Slide
1 sec) but it still fluctuates a lot between 1 mio. and 5 mio.
Performance is not my main concern, however. My concern is that the current
model assumes elements to arrive in order, which is simply not true.
In your code
I'm very sorry, I had a bug in the InversePreReducer. It should be
fixed now. Can you please run it again?
I also tried to reproduce some of your performance numbers, but I'm
getting only less than 1/10th of yours. For example, in the Tumbling
case, Current/Reduce produces only ~10 for me. Do
Hi,
I also ran the tests on top of PR 856 (inverse reducer) now. The results
seem incorrect. When I insert a Thread.sleep(1) in the tuple source, all
the previous tests reported around 3600 tuples (Size 5 sec, Slide 1 sec)
(Theoretically there would be 5000 tuples in 5 seconds but this is due to
ov
Hello,
Aljoscha, can you please try the performance test of Current/Reduce
with the InversePreReducer in PR 856? (If you just call sum, it will
use an InversePreReducer.) It would be an interesting test, because
the inverse function optimization really depends on the stream being
ordered, and I th
Thanks for writing this up and comparing to the current implementation. It's
great to see that your mockup indicates correct/expected behaviour *and* better
performance. :-)
Regarding the results for the current mechanism: does this problem affects all
window operators?
– Ufuk
On 25 Jun 2015,
I think I'll have to elaborate a bit so I created a proof-of-concept
implementation of my Ideas and ran some throughput measurements to
alleviate concerns about performance.
First, though, I want to highlight again why the current approach does not
work with out-of-order elements (which, again, oc
I agree lets separate these topics from each other so we can get faster
resolution.
There is already a state discussion in the thread we started with Paris.
On Wed, Jun 24, 2015 at 9:24 AM Kostas Tzoumas wrote:
> I agree with supporting out-of-order out of the box :-), even if this means
> a ma
I agree with supporting out-of-order out of the box :-), even if this means
a major refactoring. This is the right time to refactor the streaming API
before we pull it out of beta. I think that this is more important than new
features in the streaming API, which can be prioritized once the API is o
Out of order is ubiquitous in the real-world. Typically, what happens is
that businesses will declare a maximum allowable delay for delayed
transactions and will commit to results when that delay is reached.
Transactions that arrive later than this cutoff are collected specially as
corrections whi
I also don't like big changes but sometimes they are necessary. The reason
why I'm so adamant about out-of-order processing is that out-of-order
elements are not some exception that occurs once in a while; they occur
constantly in a distributed system. For example, in this:
https://gist.github.com/
What I like a lot about Aljoscha's proposed design is that we need no
different code for "system time" vs. "event time". It only differs in where
the timestamps are assigned.
The OOP approach also gives you the semantics of total ordering without
imposing merges on the streams.
On Tue, Jun 23, 20
I agree that there should be multiple alternatives the user(!) can
choose from. Partial out-of-order processing works for many/most
aggregates. However, if you consider Event-Pattern-Matching, global
ordering in necessary (even if the performance penalty might be high).
I would also keep "system-t
Hey
I think we should not block PRs unnecessarily if your suggested changes
might touch them at some point.
Also I still think we should not put everything in the Datastream because
it will be a huge mess.
Also we need to agree on the out of order processing, whether we want it
the way you propo
For the windowing designs, we should also have in mind what requirements we
have on the way we keep/store the elements (in external stores, Flink
managed memory, ...)
On Tue, Jun 23, 2015 at 9:55 AM, Aljoscha Krettek
wrote:
> The reason I posted this now is that we need to think about the API an
The reason I posted this now is that we need to think about the API and
windowing before proceeding with the PRs of Gabor (inverse reduce) and
Gyula (removal of "aggregate" functions on DataStream).
For the windowing, I think that the current model does not work for
out-of-order processing. Theref
Hi Aljoscha,
Thanks for the nice summary, this is a very good initiative.
I added some comments to the respective sections (where I didnt fully agree
:).).
At some point I think it would be good to have a public hangout session on
this, which could make a more dynamic discussion.
Cheers,
Gyula
Hi,
with people proposing changes to the streaming part I also wanted to throw
my hat into the ring. :D
During the last few months, while I was getting acquainted with the
streaming system, I wrote down some thoughts I had about how things could
be improved. Hopefully, they are in somewhat coheren
38 matches
Mail list logo