Re: buffering in operators, implementing statistics

Stavros Kontopoulos Tue, 31 May 2016 02:15:59 -0700

Hi Stephan,

An external project would be possible and maybe merge it in the future if
it makes sense. Just wanted to point out that in general there is a need,
but i understand priorities and may also try to work on these.


Best,
Stavros

On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi Stavros!
>
> I think what Aljoscha wants to say is that the community is a bit hard
> pressed reviewing new and complex things right now.
> There are a lot of threads going on already.
>
> If you want to work on this, why not make your own GitHub project
> "Approximate algorithms on Apache Flink" or so?
>
> Greetings,
> Stephan
>
>
>
> On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > that link was interesting, thanks! As I said though, it's probably not a
> > good fit for Flink right now.
> >
> > The things that I feel are important right now are:
> >
> >  - dynamic scaling: the ability of a streaming pipeline to adapt to
> changes
> > in the amount of incoming data. This is tricky with stateful operations
> and
> > long-running pipelines. For Spark this is easier to do because every
> > mini-batch is individually scheduled and they can therefore be scheduled
> on
> > differing numbers of machines.
> >
> >  - an API for joining static (or slowly evolving) data with streaming
> data:
> > this has been coming up in different forms on the mailing lists and when
> > talking with people. Apache Beam solves this with "side inputs". In Flink
> > we want to add something as well, maybe along the lines of side inputs or
> > maybe something more specific for the case of pure joins.
> >
> >  - working on managed memory: In Flink we were always very conscious
> about
> > how memory was used, we were using our own abstractions for dealing with
> > memory and efficient serialization. We call this the "managed memory"
> > abstraction. Spark recently also started going in this direction with
> > Project Tungsten. For the streaming API there are still some places where
> > we could make things more efficient by working on the managed memory
> more,
> > for example, there is no state backend that uses managed memory. We are
> > either completely on the Java Heap or use RocksDB there.
> >
> >  - stream SQL: this is obvious and everybody wants it.
> >
> >  - A generic cross-runner API: This is what Apache Beam (née Google
> > Dataflow) does. It is very interesting to write a program once and then
> be
> > able to run it on different runners. This brings more flexibility for
> > users. It's not clear how this will play out in the long run but it's
> very
> > interesting to keep an eye on.
> >
> > For most of these the Flink community is in various stages of
> implementing
> > it, so that's good. :-)
> >
> > Cheers,
> > Aljoscha
> >
> > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <
> st.kontopou...@gmail.com
> > >
> > wrote:
> >
> > > Hey Aljoscha,
> > >
> > > Thnax for the useful comments. I have recently looked at spark
> sketches:
> > >
> > >
> >
> http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
> > > So there must be value in this effort.
> > > In my experience counting in general is a common need for large data
> > sets.
> > > For example people often in a non streaming setting use redis for
> > > its hyperlolog algo.
> > >
> > > What are other areas you find more important or of higher priority for
> > the
> > > time being?
> > >
> > > Best,
> > > Stavros
> > >
> > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <aljos...@apache.org
> >
> > > wrote:
> > >
> > > > Hi,
> > > > no such changes are planned right now. The separaten between the keys
> > is
> > > > very strict in order to make the windowing state re-partitionable so
> > that
> > > > we can implement dynamic rescaling of the parallelism of a program.
> > > >
> > > > The WindowAll is only used for specific cases where you need a
> Trigger
> > > that
> > > > sees all elements of the stream. I personally don't think it is very
> > > useful
> > > > because it is not scaleable. In theory, for time windows this can be
> > > > parallelized but it is not currently done in Flink.
> > > >
> > > > Do you have a specific use case for the count-min sketch in mind. If
> > not,
> > > > maybe our energy is better spent on producing examples with
> real-world
> > > > applicability. I'm not against having an example for a count-min
> > sketch,
> > > > I'm just worried that you might put your energy into something that
> is
> > > not
> > > > useful to a lot of people.
> > > >
> > > > Cheers,
> > > > Aljoscha
> > > >
> > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <
> > > st.kontopou...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi thnx for the feedback.
> > > > >
> > > > > So there is a limitation due to parallel windows implementation.
> > > > > No intentions to change that somehow to accommodate similar
> > > estimations?
> > > > >
> > > > > WindowAll in practice is used as step in the pipeline? I mean since
> > its
> > > > > inherently not parallel cannot scale correct?
> > > > > Although there is an exception: "Only for special cases, such as
> > > aligned
> > > > > time windows is it possible to perform this operation in parallel"
> > > > > Probably missing something...
> > > > >
> > > > > I could try do the example stuff (and open a new feature on jira
> for
> > > > that).
> > > > > I will also vote for closing the old issue too since there is no
> > other
> > > > way
> > > > > at least for the time being...
> > > > >
> > > > > Thanx,
> > > > > Stavros
> > > > >
> > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <
> > aljos...@apache.org
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > with how the window API currently works this can only be done for
> > > > > > non-parallel windows. For keyed windows everything that happens
> is
> > > > scoped
> > > > > > to the key of the elements: window contents are kept in per-key
> > > state,
> > > > > > triggers fire on a per-key basis. Therefore a count-min sketch
> > cannot
> > > > be
> > > > > > used because it would require to keep state across keys.
> > > > > >
> > > > > > For non-parallel windows a user could do this:
> > > > > >
> > > > > > DataStream input = ...
> > > > > > input
> > > > > >   .windowAll(<some window>)
> > > > > >   .fold(new MySketch(), new MySketchFoldFunction())
> > > > > >
> > > > > > with sketch data types and a fold function that is tailored to
> the
> > > user
> > > > > > types. Therefore, I would prefer to not add a special API for
> this
> > > and
> > > > > vote
> > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I
> > already
> > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144,
> > > saying
> > > > a
> > > > > > similar thing.
> > > > > >
> > > > > > What I would welcome very much is to add some well documented
> > > examples
> > > > to
> > > > > > Flink that showcase how some of these operations can be written.
> > > > > >
> > > > > > Cheers,
> > > > > > Aljoscha
> > > > > >
> > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > > > > st.kontopou...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > I would like to push forward the work here:
> > > > > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > > > > >
> > > > > > > Can anyone more familiar with streaming api verify if this
> could
> > > be a
> > > > > > > mature task.
> > > > > > > The intention is to summarize data over a window like in the
> case
> > > of
> > > > > > > StreamGroupedFold.
> > > > > > > Specifically implement count min in a window.
> > > > > > >
> > > > > > > Best,
> > > > > > > Stavros
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: buffering in operators, implementing statistics

Reply via email to