Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Xingcan Cui Thu, 02 Mar 2017 18:23:23 -0800

Hi Visia and Greg,

I totally agree with you. The basic design idea behind Flink and Gelly's
API meets my personal taste well.


Marking stable API must be not easy as it looks like and I don't think I
am eligible to talk about it now : )

IMO, updating multiple datasets is essential for making Gelly "commonly
applicable". (The MST algorithm need to mark edges during the iteration and
I think there surely be other algorithms more complicated than that)

As for the intermediate caching problem, I think it should be users
themselves to decide when to cache the results and when to release them
(maybe Flink will also do the auto-release detection when a dataset will
not be accessed any more).

Graph computing on stream is really attractive and maybe we should find
some use cases first. I am not sure if this paper [1] (and the
corresponding project [2]) will help.

Best,
Xingcan

[1] http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf
[2] https://github.com/twitter/GraphJet

On Thu, Mar 2, 2017 at 1:16 AM, Greg Hogan <c...@greghogan.com> wrote:

> Flink’s stable API provides the frameworks (DataStream and DataSet). On
> top of these frameworks Gelly provides additional models for iterative
> algorithms, but there are algorithms such as Minimum Spanning Tree which do
> not easily map to these models (in this instance requiring nested
> iterations; for PageRank it was handling directed graphs; for HITS it was
> processing both in- and out-edges in the same iteration).
>
> One challenge with caching results is when to release the resources.
>
> New algorithms typically require new capabilities, the latter typically
> requiring much more work, so the algorithms are virtually free.
>
> Updating multiple DataSets in an iteration should be another consideration
> for improving the scheduler. Where has this been a limitation?
>
>
> > On Feb 27, 2017, at 8:03 AM, Xingcan Cui <xingc...@gmail.com> wrote:
> >
> > Hi Vasia and Greg,
> >
> > thanks for the discussion. I'd like to share my thoughts.
> >
> > 1) I don't think it's necessary to extend the algorithm list
> intentionally.
> > It's just like a textbook that can not cover all the existing algorithms
> > (even if we can). Just representative and commonly used ones will be
> > enough. After all, Gelly is mainly designed for providing a framework
> > rather than an algorithm library. Besides, it seems that Gelly's API is
> not
> > stable now and thus a huge work of refactoring or even rewriting will
> rise
> > once the API changes.
> >
> > 2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is
> > built on top of Flink, which means that it can only use operations that
> > provided by it. In my own opinion, Flink's batch processing is not so
> > outstanding as it's stream. As Grey said, one problem lies on
> intermediate
> > results caching. Though it's not clear for me (I'm still a ignorant new
> > comer...) why this feature has not been implemented for such a long time,
> > there must be some reasons. What I see is that, to some extent, it's
> > already obstructed Gelly's development. From this point of view,
> > self-blessing is better than blessing from others and I'm sure some MLers
> > may be more anxious than us :) So, I guess "within Gelly" just means a
> > Gelly-driven development?
> >
> > In a nutshell, I will encourage more concentrations on Gelly's API (or
> even
> > related Flink's API if necessary), rather than high-level things (e.g.
> > algorithms, performance) on top of it. What if we can change both the
> > edges' values and vertices' values during an iteration one day? :)
> >
> > Best,
> > Xingcan
> >
> >
> > On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com
> >> wrote:
> >
> >> Hi Greg,
> >>
> >> On 24 February 2017 at 18:09, Greg Hogan <c...@greghogan.com> wrote:
> >>
> >>> Thanks, Vasia, for starting the discussion.
> >>>
> >>> I was expecting more changes from the recent discussion on
> restructuring
> >>> the project, in particular regarding the libraries. Gelly has always
> >>> collected algorithms and I have personally taken an algorithms-first
> >>> approach for contributions. Is that manageable and maintainable? I'd
> >> prefer
> >>> to see no limit to good contributions, and if necessary split the
> >> codebase
> >>> or the project.
> >>>
> >>
> >> I don't think there should be a limit either. I do think though that
> >> development should be community-driven, i.e. not making contributions
> just
> >> for the sake of it, but evaluating their benefit first.
> >> The library already has a quite long list of algorithms. Shall we keep
> on
> >> extending it? And if yes, how do we choose which algorithms to add? Do
> we
> >> accept any algorithm even if it hasn't been asked by anyone? So far,
> we've
> >> added algorithms that we thought were useful and common. But continuing
> to
> >> extend the library like this doesn't seem maintainable to me, because we
> >> might end up with a lot of code to maintain that nobody uses. On the
> other
> >> hand, adding more algorithms might attract more users, so I see a
> trade-off
> >> there.
> >>
> >>
> >>>
> >>> If so, then a secondary goal is to make the algorithms user-accessible
> >> and
> >>> easier to review (especially at scale!). FLINK-4949 rewrites
> >>> flink-gelly-examples with modular inputs and algorithms, allows users
> to
> >>> run all existing algorithms, and makes it trivial to create a driver
> for
> >>> new algorithms (and when comparing different implementations).
> >>>
> >>
> >> I'm +1 for anything that makes using existing functionality easier.
> >> FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
> >> and/or PR description a bit? I understand the rationale but it would be
> >> nice to have a high-level description of the changes and the new
> >> functionality that the PR adds or the interfaces it modifies.
> Otherwise, it
> >> will be difficult to review a PR with +5k line changes :)
> >>
> >>
> >>
> >>>
> >>> Regarding BipartiteGraphs, without algorithms or ideas for algorithms
> >> it's
> >>> not possible to review the structure of the open pull requests.
> >>>
> >>
> >>
> >> I'm not sure I understand this point. There was a design document and
> an
> >> extensive discussion on this issue. Do you think we should revisit? Some
> >> common algorithms for bipartitite graphs that I am aware of is SALSA for
> >> recommendations and relevance search for anomaly detection.
> >>
> >>
> >>
> >>>
> >>> +1 to evaluating performance and promoting Flink!
> >>>
> >>> Gelly has two shepherds whereas CEP and ML share one committer. New
> >>> algorithms in Gelly require new features in the Batch API (Gelly may
> also
> >>> start doing streaming, we're cool kids, too)
> >>
> >>
> >> ^^
> >>
> >>
> >>> so we need to find a process
> >>> for snuffing ideas early and for the right balance in dependence on
> core
> >>> committers' time. For example, reworking the iteration scheduler to
> allow
> >>> for intermediate outputs and nested iterations. Can this feature be
> >>> developed and reviewed within Gelly?
> >>
> >> Does it need the blessing of a Stephan
> >>> or Fabian? I'd like to see contributors and committers less dependent
> on
> >>> the core team and more autonomous.
> >>>
> >>
> >> What do you mean
> >> developed and reviewed 
> >> "within Gelly"?
> >> This feature would require changes in the batch iterations code and
> will
> >> probably need to be proposed and reviewed as a FLIP, so it would need
> the
> >> blessing of the community :)
> >>
> >> Having someone who is more familiar with this part of the code help is
> of
> >> course favorable, but I don't think it's absolutely necessary.
> >>
> >> -V.
> >>
> >>
> >>> Greg
> >>>
> >>> On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri <
> >>> vasilikikala...@gmail.com> wrote:
> >>>
> >>>> Hello squirrels,
> >>>>
> >>>> this is a discussion thread to organize the Gelly component
> development
> >>> for
> >>>> release 1.3 and discuss longer-term plans for the library.
> >>>>
> >>>> I am hoping that with time-based releases, we can distribute the load
> >> for
> >>>> PR reviewing and make better use of our time, and also point
> >> contributors
> >>>> to "useful" tickets when they offer to help.
> >>>>
> >>>> I'm expecting the outcome of this discussion to be:
> >>>>
> >>>> (1) a set of open PRs to review and try merging for 1.3
> >>>> (2) a set of open JIRAs to work-on before feature freeze
> >>>> (3) a set of JIRAs and PRs to reorganize/close
> >>>> (4) ideas on possible FLIPs
> >>>>
> >>>> Here's my initial take on things, i.e. features *I* see as important
> in
> >>> the
> >>>> short-term. Feel free to add/remove/discuss:
> >>>>
> >>>> Release 1.3
> >>>> ==========
> >>>> - Bipartite graph support. Initial support has been added, but there
> >>>> are unreviewed
> >>>> PRs
> >>>> <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=
> >>>> is%3Apr%20is%3Aopen%20bipartite%20>
> >>>> and there is no Scala API yet. It would be nice to organize this
> >> feature,
> >>>> decide what functionality we need and what functionality is already
> >>> covered
> >>>> by the Graph type and have proper bipartite support for 1.3.
> >>>> - Driver improvements, i.e. #3294
> >>>> <https://github.com/apache/flink/pull/3294>
> >>>> - Algorithm improvements, #2733 <https://github.com/apache/fli
> >>> nk/pull/2733
> >>>>>
> >>>> - Affinity Propagation algorithm. This one has been developed using a
> >>> bulk
> >>>> iteration plan and needs a review. The PR is #2885
> >>>> <https://github.com/apache/flink/pull/2885>.
> >>>> - Object reuse issues, FLINK-5890, FLINK-5891
> >>>> - Vertex-centric iteration improvement, i.e. FLINK-5127
> >>>>
> >>>>
> >>>> Roadmap
> >>>> ========
> >>>> Regarding longer-term plans, I see the following issues as still being
> >>>> relevant from the existing roadmap [1]:
> >>>> - Extending the iteration functionality to support algorithms, more
> >>> complex
> >>>> than value-propagation, e.g. with nested loops
> >>>> - Partitioning methods
> >>>> - Partition-centric iterations
> >>>> - Performance evaluation
> >>>>
> >>>> These two lists are by no means complete or final and the goal of this
> >>>> thread is to see what the community is interested in, whether these
> >>>> features / additions make sense to be worked on, or what features are
> >>>> missing.
> >>>> So, please provide your feedback!
> >>>>
> >>>> Cheers,
> >>>> -V.
> >>>>
> >>>> [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Reply via email to