I think it looks good for a start, we will have to work on the API a little bit together to make it fit smoothly with what we currently have.
There is a few gaps in the timeline but that you have probably noticed :) Otherwise +1 from me. On Wed, Mar 25, 2015 at 11:35 PM, Akshay Dixit <akshayd...@gmail.com> wrote: > Hi, > The link to the draft proposal that I've prepared is > https://gist.github.com/akshaydixi/88f3fbcebab0119a6a31 > It would be great if I could get some feedback on it. > Regards, > Akshay Dixit > > On Wed, Mar 25, 2015 at 2:03 AM, Akshay Dixit <akshayd...@gmail.com> > wrote: > > > Thanks Gyula. > > > > I agree too that simple and working implementations are preferrable over > > hacky complex solutions. I'll start sketching out an initial > straighforward > > API with only basic pattern matching features > > and base it on the existing windowing API. I'll post a draft of the > > proposal, keeping the points you've said in mind, tomorrow, so you can > > look it over to see if its all right. > > Regards, > > Akshay Dixit > > > > On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <gyf...@apache.org> wrote: > > > >> Hey Dixit, > >> > >> Sorry for the delay, I had to discuss this in more detail with some of > our > >> other core developers. > >> > >> The consensus seems to be that we would like push this project in a > >> direction where the changes can be quickly included in the next > releases. > >> For this it is essential that we implement features that are complete > (and > >> clean) from the users perspective. This does not necessarily mean that > we > >> would like to have everything at once but rather that it is preferable > to > >> start with something clean and simple (for instance the naive chained > >> filter approach) and progressively build more complex logic. > >> > >> This also mean that we would like to avoid "researchy" code in the > >> codebase > >> as much as possible. Of course once we have a stable api for this > >> functionality we can work towards making the optimizations that you have > >> mentioned like operator sharing and so on. > >> > >> The ideal proposal would give a clear sketch of the pattern matching API > >> that you would like to implement, which might be some added operators at > >> first to the current API and possible a DSL later with more advanced > >> functionality (this would probably go in a separate library until it is > >> very stable). > >> > >> So please in the proposal include a preview of what the pattern matching > >> syntax would look like integrated with the current operators, how it > would > >> interact with other parts of the system etc. > >> > >> These are the thing we need to figure out before we consider the > >> optimizations I think, because it usually turns out, that the API > >> semantics > >> you would like to provide can hugely affect (probably limit) the > >> possibilities that you have afterwards in terms of optimizations. > >> > >> Let me know if you have further questions regarding this :) > >> > >> Gyula > >> > >> On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <gyf...@apache.org> wrote: > >> > >> > Hey, > >> > > >> > Give me an hour or so as I am in a meeting currently, but I will get > >> back > >> > to you afterwards. > >> > > >> > Regards, > >> > Gyula > >> > > >> > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <akshayd...@gmail.com> > >> > wrote: > >> > > >> >> Hi, > >> >> It'd really help if I got a reply soon. It'll be helpful in writing > the > >> >> proposal since the deadline is on 27th. Thanks > >> >> Regards, > >> >> Akshay Dixit > >> >> > >> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <akshayd...@gmail.com> > >> >> wrote: > >> >> > >> >> > Thanks for the explanation Marton. I've decided to try out for > >> >> FLINK-1534. > >> >> > > >> >> > After reading through the thesis[4] and a few other > papers[1][2][3], > >> I > >> >> > believe I've gathered a little context to ask more questions. But > I'm > >> >> still > >> >> > not sure how Flink's internals work > >> >> > so please bear with me. Although the ongoing effort to document the > >> >> > architecture and internal is really helpful for newbies like me and > >> >> would > >> >> > greatly decrease the ramping up time. > >> >> > > >> >> > Detecting a pattern of events would comprise of a pipeline that > >> accepts > >> >> > the pattern query and > >> >> > sources of DataStreams, and outputs detected matches of that > pattern > >> to > >> >> a > >> >> > sink or forwards it > >> >> > along to another stream for further computation. > >> >> > > >> >> > As you said, a simple filter-join-aggregate query system could be > >> >> > developed implementing using the existing Streaming windowing API. > >> >> > But matching over complex events and decoding their pattern queries > >> >> would > >> >> > require implementing a DSL that transforms queries into an > evaluation > >> >> > model. For e.g, > >> >> > in [1], the authors have implemented an NFA automaton with a shared > >> >> > versioned buffer that models the queries. In [4], the authors > >> >> > propose a new language that is much more expressive and compiles > >> into a > >> >> > topology graph for Storm. > >> >> > > >> >> > So in Flink's case, I believe the proposed DSL would generate > >> operator > >> >> > graphs for the Flink compiler to schedule Jobgraphs over > >> TaskManagers. > >> >> > If we don't depend on the Windowing API, would we need to create > new > >> >> > operators such as the Projection, Conjunction and Union operators > >> >> defined > >> >> > in [4] ? > >> >> > Also I would like to hear your thoughts on how to approach scaling > >> the > >> >> > pattern matching query. Note all these techniques talk about > scaling > >> a > >> >> > single query. > >> >> > I've read various ways such as > >> >> > > >> >> > 1. Merging equivalent runs[1] -: This seems a good way to squash > >> >> multiple > >> >> > instances of pattern matching forks into a single one if they have > >> the > >> >> same > >> >> > state. > >> >> > But I'm not sure how we would implement this in Flink since this > is a > >> >> > runtime optimization. > >> >> > > >> >> > 2. Implementing a matched version buffer[1] -: This would involve > >> >> sharing > >> >> > state of a buffer datastructure across multiple candidate match > >> >> instances > >> >> > for the pattern. > >> >> > > >> >> > 3. Splitting complex composite patterns into simpler > sub-patterns[4] > >> >> and > >> >> > executing separate queries to detect those sub-patterns. This might > >> >> > translate into different > >> >> > tasks and duplicating the source datastreams to all the new > generated > >> >> > tasks. > >> >> > > >> >> > Also since I don't know how the Flink compiler behaves, would some > of > >> >> the > >> >> > optimizations involve making changes to it too? > >> >> > > >> >> > Regards, > >> >> > Akshay Dixit > >> >> > > >> >> > [1] : Efficient Pattern Matching over Event Streams > >> >> > < > >> http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf > >> >> > > >> >> > [2] : On Supporting Kleene Closure over Event Streams > >> >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> > >> >> > [3] : Processing Flows of Information: From Data Stream to Complex > >> Event > >> >> > Processing > >> >> > < > >> >> > >> > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf > >> >> > > >> >> > [4] : Distributing Complex Event Detection > >> >> > < > >> >> > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > >> >> > > >> >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < > >> >> balassi.mar...@gmail.com> > >> >> > wrote: > >> >> > > >> >> >> Dear Akshay, > >> >> >> > >> >> >> Thanks again for your interest and for the recent contribution to > >> >> >> streaming. > >> >> >> > >> >> >> Both of the projects mentioned wold be largely appreciated by the > >> >> >> community, and you can also propose other project suggestions here > >> for > >> >> >> discussion. > >> >> >> > >> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting > >> point > >> >> >> and > >> >> >> indeed the basic solution can be implemented with filtering and > >> >> >> windowing/mapping with some state storing whether the cause of an > >> event > >> >> >> has > >> >> >> been already seen. Solely relying on the now existing windowing > API > >> >> this > >> >> >> however might cause performance issues if the events also have an > >> >> >> expiration timeout - some optimization there would be included. > The > >> >> >> further > >> >> >> challenge is to try to further exploit the parallel job execution > of > >> >> Flink > >> >> >> to possibly scale a pattern matching query. > >> >> >> > >> >> >> Best, > >> >> >> > >> >> >> Marton > >> >> >> > >> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit < > akshayd...@gmail.com > >> > > >> >> >> wrote: > >> >> >> > >> >> >> > Hi, > >> >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. > >> I'm > >> >> >> > currently interested in distributed systems and stream > processing > >> >> and am > >> >> >> > looking to delve deeper into the subject, and hope to get some > >> >> insight > >> >> >> by > >> >> >> > contributing to Apache Flink. I've gathered some idea of the > >> >> >> > flink-streaming codebase by recently working on a PR for > >> >> FLINK-1450[2]. > >> >> >> > > >> >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects > >> that I > >> >> >> would > >> >> >> > love to work on over the summer. I was wondering which amongst > >> these > >> >> >> would > >> >> >> > be more appreciated by the community, so I can start working > >> towards > >> >> a > >> >> >> > proposal for either one. > >> >> >> > > >> >> >> > Regarding FLINK-1534, I was wondering why would simply merging > and > >> >> >> > filtering the existing streams for events we want to detect not > >> work? > >> >> >> Also > >> >> >> > on going through the document mentioned by @mbalassi in the JIRA > >> >> >> > comment[5], the authors specify some Runtime Event Detection > >> >> concepts in > >> >> >> > Section 5.2. I'm assuming the project entails on building a > >> similar > >> >> >> analogy > >> >> >> > using Flink and the deliverables would include working pattern > >> >> matching > >> >> >> > operators over Flink DataStreams as described in the report. If > >> so, > >> >> then > >> >> >> > shouldn't it be trivial to implement the described the Binary > >> >> operator > >> >> >> > using a WindowedStream and a Filter? > >> >> >> > I hope my questions don't seem misplaced here and I would > >> appreciate > >> >> >> links > >> >> >> > to literature where I can learn more on the topic. > >> >> >> > > >> >> >> > Regards, > >> >> >> > Akshay Dixit > >> >> >> > > >> >> >> > [1] : http://akshaydixi.me > >> >> >> > [2] : https://github.com/apache/flink/pull/481 > >> >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > >> >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > >> >> >> > [5] : > >> >> >> > > >> >> > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > >> >> >> > > >> >> >> > >> >> > > >> >> > > >> >> > >> > > >> > > >> > > > > >