hi,Gábor Hermann

The online parameter server is a good proposal. 
PS’ paper [1] have a early implement [2], and now it’s mxnet [3].
I have some thought about online PS in Flink:
1.      Whether support flexible and configurable update strategy?
        For example, in one iteration, computing serval times updating once or 
update every time of iteration.
2.      Whether we implement is fully based on the DAG, having not too much 
modify the runtime and core?
        -       The parameter server is the Source with distributed parameter 
data, called PS.
        -       The worker nodes are the DAG except the Source. That is some ML 
arithmetic implemented using Flink API.
        -       Multiple layer computing’s result flow to the Sink operator 
naturally
        -       Sink can feedback to the Source for next iteration
        -       Atomic tuning the busy operator, increase/decrease the compute 
resource(the max parallelism) of the runtime operators.
3.      Atomically detect GPU supporting provided by Mesos, and use it if 
enable configuration of using GPU.

[1] https://www.cs.cmu.edu/~muli/file/ps.pdf 
<https://www.cs.cmu.edu/~muli/file/ps.pdf>
[2] https://github.com/dmlc/parameter_server 
<https://github.com/dmlc/parameter_server>
[3] https://github.com/dmlc/mxnet <https://github.com/dmlc/mxnet>

> On Feb 12, 2017, at 00:54, Gyula Fóra <gyf...@apache.org> wrote:
> 
> Hi Gábor,
> 
> I think the general idea is very nice, but it would nice to see clearer
> what benefit does this bring from the developers perspective. Maybe rough
> API sketch and 1-2 examples.
> 
> I am wondering what sort of consistency guarantees do you imagine for such
> operations, or why the fault tolerance is even relevant. Are you thinking
> about an asynchronous API such as querying the state for another key might
> give you a Future that is guaranteed to complete eventually.
> 
> It seems to be hard to guarantee the timeliness (order) of these operations
> with respect to the updates made to the states, so I wonder if there is
> benefit of doing this compared to using the Queryable state interface. Is
> this only a potential performance improvement or is it easier to work with
> this?
> 
> Cheers,
> Gyula
> 
> Gábor Hermann <m...@gaborhermann.com> ezt írta (időpont: 2017. febr. 10.,
> P, 16:01):
> 
>> Hi all,
>> 
>> TL;DR: Is it worth to implement a special QueryableState for querying
>> state from another part of a Flink streaming job and aligning it with
>> fault tolerance?
>> 
>> I've been thinking about implementing a Parameter Server with/within
>> Flink. A Parameter Server is basically a specialized key-value store
>> optimized for training distributed machine learning models. So not only
>> the training data, but also the model is distributed. Range queries are
>> typical, and usually vectors and matrices are stored as values.
>> 
>> More generally, an integrated key-value store might also be useful in
>> the Streaming API. Although external key-value stores can be used inside
>> UDFs for the same purpose, aligning them with the fault tolerance
>> mechanism of Flink could be hard. What if state distributed by a key (in
>> the current Streaming API) could be queried from another operator? Much
>> like QueryableState, but querying *inside* the Flink job. We could make
>> use of the fact that state has been queried from inside to optimize
>> communication and integrate fault tolerance.
>> 
>> The question is whether the Flink community would like such feature, and
>> if so how to do it?
>> 
>> I could elaborate my ideas if needed, and I'm happy to create a design
>> doc, but before that, I'd like to know what you all think about this.
>> Also, I don't know if I'm missing something, so please correct me. Here
>> are some quick notes regarding the integrated KV-store:
>> 
>> Pros
>> - It could allow easier implementation of more complicated use-cases.
>> E.g. updating users preferences simultaneously based on each others
>> preferences when events happen between them such as making a connection,
>> liking each other posts, or going to the same concert. User preferences
>> are distributed as a state, an event about user A liking user B gets
>> sent to A's state and queries the state of B, then updates the state of
>> B. There have been questions on the user mailing list for similar
>> problems [1].
>> - Integration with fault tolerance. User does not have to maintain two
>> systems consistently.
>> - Optimization potentials. At the social network example maybe other
>> users on the same partitions with user A need the state of user B, so we
>> don't have to send around user B twice.
>> - Could specialize to a Parameter Server for simple (and efficient)
>> implementation of (possibly online) machine learning. E.g. sparse
>> logistic regression, LDA, matrix factorization for recommendation systems.
>> 
>> Cons
>> - Lot of development effort.
>> - "Client-server" architecture goes against the DAG dataflow model.
>> 
>> Two approaches for the implementation in the streaming API:
>> 
>> 1) An easy way to implement this is to use iterations (or the proposed
>> loops API). We can achieve two-way communication by two operators in a
>> loop: a worker (W) and a Parameter Server (PS), see the diagram [2]. (An
>> additional nested loop in the PS could add replication opportunities).
>> Then we would get fault tolerance "for free" by the work of Paris [3].
>> It would also be on top of the Streaming API, with no effect on the
>> runtime.
>> 
>> 2) A problem with the loop approach is that coordination between PS
>> nodes and worker nodes can only be done on the data stream. We could not
>> really use e.g. Akka for async coordination. A harder but more flexible
>> way is to use lower-level interfaces of Flink and touch the runtime.
>> Then we would have to take care of fault tolerance too.
>> 
>> (As a side note: in the batch API generalizing delta iterations could be
>> a solution for Parameter Server [4].)
>> 
>> Thanks for any feedback :)
>> 
>> Cheers,
>> Gabor
>> 
>> [1]
>> 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/sharded-state-2-step-operation-td8631.html
>> [2] https://i.imgur.com/GsliUIh.png
>> [3] https://github.com/apache/flink/pull/1668
>> [4]
>> 
>> https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran
>> 
>> 

Reply via email to