Re: [DISCUSS] Flink ML roadmap

Gábor Hermann Thu, 23 Feb 2017 08:22:57 -0800

@Theodore, thanks for taking lead in the coordination :)

Let's see what we can do, and then decide what should start out as anindependent project, or strictly inside Flink.I agree that something experimental like batch ML on streaming wouldprobably benefit more an independent repo first.


On 2017-02-23 16:56, Theodore Vasiloudis wrote:

Sure having a deadline for March 3rd is fine. I can act as coordinator,
trying to guide the discussion to concrete results.

For committers it's up to their discretion and time if one wants to
participate. I don't think it's necessary to have one, but it would be most
welcome.

@Katherin I would suggest you start a topic on the list about FLINK-1730,
if it takes a lot of development effort from your side it's best to at
least try to gauge the community's interest, and whether there will be
motivation to merge the changes.

Maybe at the end of this we have a FLIP we can submit, that's probably the
way forward if we want to keep this effort within the project. For a new,
highly experimental project like batch ML on streaming I would actually
favor developing on an independent repo, which can later be merged into
main if there is interest.

Regards.
Theodore

On Thu, Feb 23, 2017 at 4:41 PM, Gábor Hermann <m...@gaborhermann.com>
wrote:

Okay, let's just aim for around the end of next week, but we can take more
time to discuss if there's still a lot of ongoing activity. Keep the topic
hot!

Thanks all for the enthusiasm :)



On 2017-02-23 16:17, Stavros Kontopoulos wrote:

@Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
a week may suit more people.
What do you think all?
I will contribute to the doc.

+100 for having a co-ordinator + commiter.

Thank you all for joining the discussion.

Cheers,
Stavros

On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <m...@gaborhermann.com>
wrote:

Okay, I've created a skeleton of the design doc for choosing a direction:

https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
49h3Ud06MIRhahtJ6dw/edit?usp=sharing

Much of the pros/cons have already been discussed here, so I'll try to
put
there all the arguments mentioned in this thread. Feel free to put there
more :)

@Stavros: I agree we should take action fast. What about collecting our
thoughts in the doc by around Tuesday next week (28. February)? Then
decide
on the direction and design a roadmap by around Friday (3. March)? Is
that
feasible, or should it take more time?

I think it will be necessary to have a shepherd, or even better a
committer, to be involved in at least reviewing and accepting the
roadmap.
It would be best, if a committer coordinated all this.
@Theodore: Would you like to do the coordination?

Regarding the use-cases: I've seen some abstracts of talks at SF Flink
Forward [1] that seem promising. There are companies already using Flink
for ML [2,3,4,5].

[1] http://sf.flink-forward.org/program/sessions/
[2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
eaming-vs-micro-batch-for-online-learning/
[3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
nsorflow/
[4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
arning-on-flink/
[5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
ing-scenarios-with-flink/

Cheers,
Gabor



On 2017-02-23 15:19, Katherin Eri wrote:

I have asked already some teams for useful cases, but all of them need

time
to think.
During analysis something will finally arise.
May be we can ask partners of Flink  for cases? Data Artisans got
results
of customers survey: [1], ML better support is wanted, so we could ask
what
exactly is necessary.

[1] http://data-artisans.com/flink-user-survey-2016-part-2/

23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
st.kontopou...@gmail.com> написал:

+100 for a design doc.

Could we also set a roadmap after some time-boxed investigation
captured
in
that document? We need action.

Looking forward to work on this (whatever that might be) ;) Also are
there
any data supporting one direction or the other from a customer
perspective?
It would help to make more informed decisions.

On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com>
wrote:

Yes, ok.

let's start some design document, and write down there already
mentioned
ideas about: parameter server, about clipper and others. Would be nice
if
we will also map this approaches to cases.
Will work on it collaboratively on each topic, may be finally we will

form

some picture, that could be agreed with committers.

@Gabor, could you please start such shared doc, as you have already

several

ideas proposed?

чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:

I agree, that it's better to go in one direction first, but I think

online and offline with streaming API can go somewhat parallel later.

We

could set a short-term goal, concentrate initially on one direction,
and
showcase that direction (e.g. in a blogpost). But first, we should
list

the pros/cons in a design doc as a minimum. Then make a decision what
direction to go. Would that be feasible?

On 2017-02-23 12:34, Katherin Eri wrote:

I'm not sure that this is feasible, doing all at the same time could
mean
doing nothing((((

I'm just afraid, that words: we will work on streaming not on

batching,

we

have no commiter's time for this, mean that yes, we started work on

FLINK-1730, but nobody will commit this work in the end, as it

already

was

with this ticket.

23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <

m...@gaborhermann.com>

написал:

@Theodore: Great to hear you think the "batch on streaming" approach
is

possible! Of course, we need to pay attention all the pitfalls

there,

if we
go that way.

+1 for a design doc!

I would add that it's possible to make efforts in all the three

directions

(i.e. batch, online, batch on streaming) at the same time. Although,
it

might be worth to concentrate on one. E.g. it would not be so useful

to

have the same batch algorithms with both the batch API and streaming

API.
We can decide later.

The design doc could be partitioned to these 3 directions, and we

can

collect there the pros/cons too. What do you think?
Cheers,

Gabor


On 2017-02-23 12:13, Theodore Vasiloudis wrote:

Hello all,

@Gabor, we have discussed the idea of using the streaming API to

write

all

of our ML algorithms with a couple of people offline,

and I think it might be possible and is generally worth a shot. The

approach we would take would be close to Vowpal Wabbit, not
exactly
"online", but rather "fast-batch".

There will be problems popping up again, even for very simple
algos

like

on

line linear regression with SGD [1], but hopefully fixing those

will

be

more aligned with the priorities of the community.

@Katherin, my understanding is that given the limited resources,

there

is

no development effort focused on batch processing right now.

So to summarize, it seems like there are people willing to work on

ML

on

Flink, but nobody is sure how to do it.

There are many directions we could take (batch, online, batch on

streaming), each with its own merits and downsides.

If you want we can start a design doc and move the conversation

there,

come
up with a roadmap and start implementing.

Regards,
Theodore

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.
nabble.com/Understanding-connected-streams-use-without-times
tamps-td10241.html

On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <

m...@gaborhermann.com

wrote:
It's great to see so much activity in this discussion :)

I'll try to add my thoughts.

I think building a developer community (Till's 2. point) can be

slightly

separated from what features we should aim for (1. point) and
showcasing
(3. point). Thanks Till for bringing up the ideas for
restructuring,
I'm

sure we'll find a way to make the development process more

dynamic.
I'll

try to address the rest here.

It's hard to choose directions between streaming and batch ML. As

Theo

has

indicated, not much online ML is used in production, but Flink

concentrates
on streaming, so online ML would be a better fit for Flink.

However,

as

most of you argued, there's definite need for batch ML. But batch

ML
seems

hard to achieve because there are blocking issues with persisting,

iteration paths etc. So it's no good either way.

I propose a seemingly crazy solution: what if we developed batch
algorithms also with the streaming API? The batch API would

clearly

seem

more suitable for ML algorithms, but there a lot of benefits of

this
approach too, so it's clearly worth considering. Flink also has

the

high

level vision of "streaming for everything" that would clearly fit

this
case. What do you all think about this? Do you think this solution

would

be
feasible? I would be happy to make a more elaborate proposal, but

push

my

main ideas here:

1) Simplifying by using one system
It could simplify the work of both the users and the developers.

One

could

execute training once, or could execute it periodically e.g. by

using

windows. Low-latency serving and training could be done in the

same

system.

We could implement incremental algorithms, without any side inputs

for

combining online learning (or predictions) with batch learning. Of

course,

all the logic describing these must be somehow implemented (e.g.
synchronizing predictions with training), but it should be easier

to

do

so

in one system, than by combining e.g. the batch and streaming API.

2) Batch ML with the streaming API is not harder
Despite these benefits, it could seem harder to implement batch
ML

with

the streaming API, but in my opinion it's not. There are more
flexible,
lower-level optimization potentials with the streaming API. Most
distributed ML algorithms use a lower-level model than the batch

API

anyway, so sometimes it feels like forcing the algorithm logic

into

the

training API and tweaking it. Although we could not use the batch

primitives like join, we would have the E.g. in my experience with

implementing a distributed matrix factorization algorithm [1], I

couldn't

do a simple optimization because of the limitations of the
iteration
API

[2]. Even if we pushed all the development effort to make the

batch
API

more suitable for ML there would be things we couldn't do. E.g.

there
are

approaches for updating a model iteratively without locks [3,4]

(i.e.
somewhat asynchronously), and I don't see a clear way to implement

such

algorithms with the batch API.
3) Streaming community (users and devs) benefit

The Flink streaming community in general would also benefit from

this

direction. There are many features needed in the streaming API for

ML

to

work, but this is also true for the batch API. One really

important
is

the

loops API (a.k.a. iterative DataStreams) [5]. There has been a lot

of

effort (mostly from Paris) for making it mature enough [6]. Kate

mentioned

using GPUs, and I'm sure they have uses in streaming generally

[7].

Thus,

by improving the streaming API to allow ML algorithms, the

streaming
API

benefit too (which is important as they have a lot more production

users
than the batch API).
4) Performance can be at least as good

I believe the same performance could be achieved with the

streaming

API

as

with the batch API. Streaming API is much closer to the runtime

than

the

batch API. For corner-cases, with runtime-layer optimizations of

batch
API,

we could find a way to do the same (or similar) optimization for

the

streaming API (see my previous point). Such case could be using

managed

memory (and spilling to disk). There are also benefits by default,
e.g.
we
would have a finer grained fault tolerance with the streaming API.

5) We could keep batch ML API
For the shorter term, we should not throw away all the algorithms
implemented with the batch API. By pushing forward the
development

with

side inputs we could make them usable with streaming API. Then, if
the
library gains some popularity, we could replace the algorithms in

the

batch

API with streaming ones, to avoid the performance costs of e.g.

not

being

able to persist.

6) General tools for implementing ML algorithms

Besides implementing algorithms one by one, we could give more

general

tools for making it easier to implement algorithms. E.g. parameter

server

[8,9]. Theo also mentioned in another thread that TensorFlow has a
similar

model to Flink streaming, we could look into that too. I think

often

when

deploying a production ML system, much more configuration and

tweaking
should be done than e.g. Spark MLlib allows. Why not allow that?

7) Showcasing

Showcasing this could be easier. We could say that we're doing

batch

ML

with a streaming API. That's interesting in its own. IMHO this

integration

is also a more approachable way towards end-to-end ML.


Thanks for reading so far :)

[1] https://github.com/apache/flink/pull/2819
[2] https://issues.apache.org/jira/browse/FLINK-2396
[3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pd
f
[4] https://www.usenix.org/system/files/conference/hotos13/hotos
13-final77.pdf
[5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
Scoped+Loops+and+Job+Termination
[6] https://github.com/apache/flink/pull/1668
[7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.

pdf

[8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf

[9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.

com/Using-QueryableState-inside-Flink-jobs-and-
Parameter-Server-implementation-td15880.html

Cheers,
Gabor


--

*Yours faithfully, *

*Kate Eri.*

Re: [DISCUSS] Flink ML roadmap

Reply via email to