Hi Visia and Greg, I totally agree with you. The basic design idea behind Flink and Gelly's API meets my personal taste well.
Marking stable API must be not easy as it looks like and I don't think I am eligible to talk about it now : ) IMO, updating multiple datasets is essential for making Gelly "commonly applicable". (The MST algorithm need to mark edges during the iteration and I think there surely be other algorithms more complicated than that) As for the intermediate caching problem, I think it should be users themselves to decide when to cache the results and when to release them (maybe Flink will also do the auto-release detection when a dataset will not be accessed any more). Graph computing on stream is really attractive and maybe we should find some use cases first. I am not sure if this paper [1] (and the corresponding project [2]) will help. Best, Xingcan [1] http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf [2] https://github.com/twitter/GraphJet On Thu, Mar 2, 2017 at 1:16 AM, Greg Hogan <c...@greghogan.com> wrote: > Flink’s stable API provides the frameworks (DataStream and DataSet). On > top of these frameworks Gelly provides additional models for iterative > algorithms, but there are algorithms such as Minimum Spanning Tree which do > not easily map to these models (in this instance requiring nested > iterations; for PageRank it was handling directed graphs; for HITS it was > processing both in- and out-edges in the same iteration). > > One challenge with caching results is when to release the resources. > > New algorithms typically require new capabilities, the latter typically > requiring much more work, so the algorithms are virtually free. > > Updating multiple DataSets in an iteration should be another consideration > for improving the scheduler. Where has this been a limitation? > > > > On Feb 27, 2017, at 8:03 AM, Xingcan Cui <xingc...@gmail.com> wrote: > > > > Hi Vasia and Greg, > > > > thanks for the discussion. I'd like to share my thoughts. > > > > 1) I don't think it's necessary to extend the algorithm list > intentionally. > > It's just like a textbook that can not cover all the existing algorithms > > (even if we can). Just representative and commonly used ones will be > > enough. After all, Gelly is mainly designed for providing a framework > > rather than an algorithm library. Besides, it seems that Gelly's API is > not > > stable now and thus a huge work of refactoring or even rewriting will > rise > > once the API changes. > > > > 2) Unlike other "pure" graph computing framework (e.g. giraph), Gelly is > > built on top of Flink, which means that it can only use operations that > > provided by it. In my own opinion, Flink's batch processing is not so > > outstanding as it's stream. As Grey said, one problem lies on > intermediate > > results caching. Though it's not clear for me (I'm still a ignorant new > > comer...) why this feature has not been implemented for such a long time, > > there must be some reasons. What I see is that, to some extent, it's > > already obstructed Gelly's development. From this point of view, > > self-blessing is better than blessing from others and I'm sure some MLers > > may be more anxious than us :) So, I guess "within Gelly" just means a > > Gelly-driven development? > > > > In a nutshell, I will encourage more concentrations on Gelly's API (or > even > > related Flink's API if necessary), rather than high-level things (e.g. > > algorithms, performance) on top of it. What if we can change both the > > edges' values and vertices' values during an iteration one day? :) > > > > Best, > > Xingcan > > > > > > On Sat, Feb 25, 2017 at 2:43 AM, Vasiliki Kalavri < > vasilikikala...@gmail.com > >> wrote: > > > >> Hi Greg, > >> > >> On 24 February 2017 at 18:09, Greg Hogan <c...@greghogan.com> wrote: > >> > >>> Thanks, Vasia, for starting the discussion. > >>> > >>> I was expecting more changes from the recent discussion on > restructuring > >>> the project, in particular regarding the libraries. Gelly has always > >>> collected algorithms and I have personally taken an algorithms-first > >>> approach for contributions. Is that manageable and maintainable? I'd > >> prefer > >>> to see no limit to good contributions, and if necessary split the > >> codebase > >>> or the project. > >>> > >> > >> I don't think there should be a limit either. I do think though that > >> development should be community-driven, i.e. not making contributions > just > >> for the sake of it, but evaluating their benefit first. > >> The library already has a quite long list of algorithms. Shall we keep > on > >> extending it? And if yes, how do we choose which algorithms to add? Do > we > >> accept any algorithm even if it hasn't been asked by anyone? So far, > we've > >> added algorithms that we thought were useful and common. But continuing > to > >> extend the library like this doesn't seem maintainable to me, because we > >> might end up with a lot of code to maintain that nobody uses. On the > other > >> hand, adding more algorithms might attract more users, so I see a > trade-off > >> there. > >> > >> > >>> > >>> If so, then a secondary goal is to make the algorithms user-accessible > >> and > >>> easier to review (especially at scale!). FLINK-4949 rewrites > >>> flink-gelly-examples with modular inputs and algorithms, allows users > to > >>> run all existing algorithms, and makes it trivial to create a driver > for > >>> new algorithms (and when comparing different implementations). > >>> > >> > >> I'm +1 for anything that makes using existing functionality easier. > >> FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA > >> and/or PR description a bit? I understand the rationale but it would be > >> nice to have a high-level description of the changes and the new > >> functionality that the PR adds or the interfaces it modifies. > Otherwise, it > >> will be difficult to review a PR with +5k line changes :) > >> > >> > >> > >>> > >>> Regarding BipartiteGraphs, without algorithms or ideas for algorithms > >> it's > >>> not possible to review the structure of the open pull requests. > >>> > >> > >> > >> I'm not sure I understand this point. There was a design document and > an > >> extensive discussion on this issue. Do you think we should revisit? Some > >> common algorithms for bipartitite graphs that I am aware of is SALSA for > >> recommendations and relevance search for anomaly detection. > >> > >> > >> > >>> > >>> +1 to evaluating performance and promoting Flink! > >>> > >>> Gelly has two shepherds whereas CEP and ML share one committer. New > >>> algorithms in Gelly require new features in the Batch API (Gelly may > also > >>> start doing streaming, we're cool kids, too) > >> > >> > >> ^^ > >> > >> > >>> so we need to find a process > >>> for snuffing ideas early and for the right balance in dependence on > core > >>> committers' time. For example, reworking the iteration scheduler to > allow > >>> for intermediate outputs and nested iterations. Can this feature be > >>> developed and reviewed within Gelly? > >> > >> Does it need the blessing of a Stephan > >>> or Fabian? I'd like to see contributors and committers less dependent > on > >>> the core team and more autonomous. > >>> > >> > >> What do you mean > >> developed and reviewed > >> "within Gelly"? > >> This feature would require changes in the batch iterations code and > will > >> probably need to be proposed and reviewed as a FLIP, so it would need > the > >> blessing of the community :) > >> > >> Having someone who is more familiar with this part of the code help is > of > >> course favorable, but I don't think it's absolutely necessary. > >> > >> -V. > >> > >> > >>> Greg > >>> > >>> On Fri, Feb 24, 2017 at 10:39 AM, Vasiliki Kalavri < > >>> vasilikikala...@gmail.com> wrote: > >>> > >>>> Hello squirrels, > >>>> > >>>> this is a discussion thread to organize the Gelly component > development > >>> for > >>>> release 1.3 and discuss longer-term plans for the library. > >>>> > >>>> I am hoping that with time-based releases, we can distribute the load > >> for > >>>> PR reviewing and make better use of our time, and also point > >> contributors > >>>> to "useful" tickets when they offer to help. > >>>> > >>>> I'm expecting the outcome of this discussion to be: > >>>> > >>>> (1) a set of open PRs to review and try merging for 1.3 > >>>> (2) a set of open JIRAs to work-on before feature freeze > >>>> (3) a set of JIRAs and PRs to reorganize/close > >>>> (4) ideas on possible FLIPs > >>>> > >>>> Here's my initial take on things, i.e. features *I* see as important > in > >>> the > >>>> short-term. Feel free to add/remove/discuss: > >>>> > >>>> Release 1.3 > >>>> ========== > >>>> - Bipartite graph support. Initial support has been added, but there > >>>> are unreviewed > >>>> PRs > >>>> <https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q= > >>>> is%3Apr%20is%3Aopen%20bipartite%20> > >>>> and there is no Scala API yet. It would be nice to organize this > >> feature, > >>>> decide what functionality we need and what functionality is already > >>> covered > >>>> by the Graph type and have proper bipartite support for 1.3. > >>>> - Driver improvements, i.e. #3294 > >>>> <https://github.com/apache/flink/pull/3294> > >>>> - Algorithm improvements, #2733 <https://github.com/apache/fli > >>> nk/pull/2733 > >>>>> > >>>> - Affinity Propagation algorithm. This one has been developed using a > >>> bulk > >>>> iteration plan and needs a review. The PR is #2885 > >>>> <https://github.com/apache/flink/pull/2885>. > >>>> - Object reuse issues, FLINK-5890, FLINK-5891 > >>>> - Vertex-centric iteration improvement, i.e. FLINK-5127 > >>>> > >>>> > >>>> Roadmap > >>>> ======== > >>>> Regarding longer-term plans, I see the following issues as still being > >>>> relevant from the existing roadmap [1]: > >>>> - Extending the iteration functionality to support algorithms, more > >>> complex > >>>> than value-propagation, e.g. with nested loops > >>>> - Partitioning methods > >>>> - Partition-centric iterations > >>>> - Performance evaluation > >>>> > >>>> These two lists are by no means complete or final and the goal of this > >>>> thread is to see what the community is interested in, whether these > >>>> features / additions make sense to be worked on, or what features are > >>>> missing. > >>>> So, please provide your feedback! > >>>> > >>>> Cheers, > >>>> -V. > >>>> > >>>> [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly > >>>> > >>> > >> > >