Hi Stephan, An external project would be possible and maybe merge it in the future if it makes sense. Just wanted to point out that in general there is a need, but i understand priorities and may also try to work on these.
Best, Stavros On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen <se...@apache.org> wrote: > Hi Stavros! > > I think what Aljoscha wants to say is that the community is a bit hard > pressed reviewing new and complex things right now. > There are a lot of threads going on already. > > If you want to work on this, why not make your own GitHub project > "Approximate algorithms on Apache Flink" or so? > > Greetings, > Stephan > > > > On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <aljos...@apache.org> > wrote: > > > Hi, > > that link was interesting, thanks! As I said though, it's probably not a > > good fit for Flink right now. > > > > The things that I feel are important right now are: > > > > - dynamic scaling: the ability of a streaming pipeline to adapt to > changes > > in the amount of incoming data. This is tricky with stateful operations > and > > long-running pipelines. For Spark this is easier to do because every > > mini-batch is individually scheduled and they can therefore be scheduled > on > > differing numbers of machines. > > > > - an API for joining static (or slowly evolving) data with streaming > data: > > this has been coming up in different forms on the mailing lists and when > > talking with people. Apache Beam solves this with "side inputs". In Flink > > we want to add something as well, maybe along the lines of side inputs or > > maybe something more specific for the case of pure joins. > > > > - working on managed memory: In Flink we were always very conscious > about > > how memory was used, we were using our own abstractions for dealing with > > memory and efficient serialization. We call this the "managed memory" > > abstraction. Spark recently also started going in this direction with > > Project Tungsten. For the streaming API there are still some places where > > we could make things more efficient by working on the managed memory > more, > > for example, there is no state backend that uses managed memory. We are > > either completely on the Java Heap or use RocksDB there. > > > > - stream SQL: this is obvious and everybody wants it. > > > > - A generic cross-runner API: This is what Apache Beam (née Google > > Dataflow) does. It is very interesting to write a program once and then > be > > able to run it on different runners. This brings more flexibility for > > users. It's not clear how this will play out in the long run but it's > very > > interesting to keep an eye on. > > > > For most of these the Flink community is in various stages of > implementing > > it, so that's good. :-) > > > > Cheers, > > Aljoscha > > > > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos < > st.kontopou...@gmail.com > > > > > wrote: > > > > > Hey Aljoscha, > > > > > > Thnax for the useful comments. I have recently looked at spark > sketches: > > > > > > > > > http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics > > > So there must be value in this effort. > > > In my experience counting in general is a common need for large data > > sets. > > > For example people often in a non streaming setting use redis for > > > its hyperlolog algo. > > > > > > What are other areas you find more important or of higher priority for > > the > > > time being? > > > > > > Best, > > > Stavros > > > > > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <aljos...@apache.org > > > > > wrote: > > > > > > > Hi, > > > > no such changes are planned right now. The separaten between the keys > > is > > > > very strict in order to make the windowing state re-partitionable so > > that > > > > we can implement dynamic rescaling of the parallelism of a program. > > > > > > > > The WindowAll is only used for specific cases where you need a > Trigger > > > that > > > > sees all elements of the stream. I personally don't think it is very > > > useful > > > > because it is not scaleable. In theory, for time windows this can be > > > > parallelized but it is not currently done in Flink. > > > > > > > > Do you have a specific use case for the count-min sketch in mind. If > > not, > > > > maybe our energy is better spent on producing examples with > real-world > > > > applicability. I'm not against having an example for a count-min > > sketch, > > > > I'm just worried that you might put your energy into something that > is > > > not > > > > useful to a lot of people. > > > > > > > > Cheers, > > > > Aljoscha > > > > > > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos < > > > st.kontopou...@gmail.com > > > > > > > > > wrote: > > > > > > > > > Hi thnx for the feedback. > > > > > > > > > > So there is a limitation due to parallel windows implementation. > > > > > No intentions to change that somehow to accommodate similar > > > estimations? > > > > > > > > > > WindowAll in practice is used as step in the pipeline? I mean since > > its > > > > > inherently not parallel cannot scale correct? > > > > > Although there is an exception: "Only for special cases, such as > > > aligned > > > > > time windows is it possible to perform this operation in parallel" > > > > > Probably missing something... > > > > > > > > > > I could try do the example stuff (and open a new feature on jira > for > > > > that). > > > > > I will also vote for closing the old issue too since there is no > > other > > > > way > > > > > at least for the time being... > > > > > > > > > > Thanx, > > > > > Stavros > > > > > > > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek < > > aljos...@apache.org > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > with how the window API currently works this can only be done for > > > > > > non-parallel windows. For keyed windows everything that happens > is > > > > scoped > > > > > > to the key of the elements: window contents are kept in per-key > > > state, > > > > > > triggers fire on a per-key basis. Therefore a count-min sketch > > cannot > > > > be > > > > > > used because it would require to keep state across keys. > > > > > > > > > > > > For non-parallel windows a user could do this: > > > > > > > > > > > > DataStream input = ... > > > > > > input > > > > > > .windowAll(<some window>) > > > > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > > > > > > > with sketch data types and a fold function that is tailored to > the > > > user > > > > > > types. Therefore, I would prefer to not add a special API for > this > > > and > > > > > vote > > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I > > already > > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, > > > saying > > > > a > > > > > > similar thing. > > > > > > > > > > > > What I would welcome very much is to add some well documented > > > examples > > > > to > > > > > > Flink that showcase how some of these operations can be written. > > > > > > > > > > > > Cheers, > > > > > > Aljoscha > > > > > > > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > > > > st.kontopou...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > I would like to push forward the work here: > > > > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > > > > > > > Can anyone more familiar with streaming api verify if this > could > > > be a > > > > > > > mature task. > > > > > > > The intention is to summarize data over a window like in the > case > > > of > > > > > > > StreamGroupedFold. > > > > > > > Specifically implement count min in a window. > > > > > > > > > > > > > > Best, > > > > > > > Stavros > > > > > > > > > > > > > > > > > > > > > > > > > > > >