Hi As per previous discussions, I have created a temporary repository in GitHub under my personal GitHub Id(avijitbasak). The artifacts have been copied from commons-numbers. A preliminary structure has been created for the proposed component. Please let me know if we want to proceed with this format. We can copy the same to any other team repository if required.
Repo URL: https://github.com/avijitbasak/commons-machinelearning Thanks & Regards --Avijit Basak On Mon, 26 Apr 2021 at 04:49, Paul King <paul.king.as...@gmail.com> wrote: > On Mon, Apr 26, 2021 at 12:27 AM sebb <seb...@gmail.com> wrote: > > > > I assume this thread is about the possible ML component. > > > > If the code was developed by Commons, I assume it could be used as > > part of Spark. > > However Commons does not currently have many developers who are > > familiar with the field. > > So it would seem to me better to have development done by a project > > which does have relevant experience. > > > > You say that Spark etc have lots of jars. > > Surely that allows for it to be implemented as a separate jar which > > can either be used as part of the Spark platform, or used > > independently? > > The stats I gave were for the current minimal use of those algorithms. > Most algorithms are written in Scala, use RDD "dataframes" rather than > say double arrays, and assume you're running on "the platform" which > handles how you might get your data and return results and do logging > etc. in a potentially concurrent world. Some of those design choices > are key to scaling up but don't align with the goal of making the > algorithms runnable "independently". > > > The only other option I see is for Commons to persuade some developers > > who are familiar with the field to join Commons to assist with the > > algorithms. > > I agree that is the crux of the issue here. The "commons doesn't have > the bandwidth to absorb another algorithm" part of the discussion > seems perfectly legit to me. The "and there is an obvious home > elsewhere" part of the discussion seemed a little more dubious to me, > though obviously that is something which should be considered. > > > Existing Commons developers can help manage the logistics of packaging > > and releasing the code, as this does not require in depth knowledge of > > the design. > > However this only makes sense if the developers skilled in the are are > > prepared to assist long-term. > > > > > > On Sat, 24 Apr 2021 at 23:32, Paul King <paul.king.as...@gmail.com> > wrote: > > > > > > Thanks Gilles, > > > > > > I can provide the same sort of stats across a clustering example > > > across commons-math (KMeans) vs Apache Ignite, Apache Spark and > > > Rheem/Apache Wayang (incubating) if anyone would find that useful. It > > > would no doubt lead to similar conclusions. > > > > > > Cheers, Paul. > > > > > > On Sun, Apr 25, 2021 at 8:15 AM Gilles Sadowski <gillese...@gmail.com> > wrote: > > > > > > > > Hello Paul. > > > > > > > > Le sam. 24 avr. 2021 à 04:42, Paul King <paul.king.as...@gmail.com> > a écrit : > > > > > > > > > > I added some more comments relevant to if the proposed algorithm > > > > > belongs somewhere in the commons "math" area back in the Jira: > > > > > > > > > > https://issues.apache.org/jira/browse/MATH-1563 > > > > > > > > Thanks for a "real" user's testimony. > > > > > > > > As the ML is still the official forum for such a discussion, I'm > quoting > > > > part of your post on JIRA: > > > > ---CUT--- > > > > For linear regression, taking just one example dataset, commons-math > > > > is a couple of library calls for a single 2M library and solves the > > > > problem in 240ms. Both Ignite and Spark involve "firing up the > > > > platform" and the code is more complex for simple scenarios. Spark > has > > > > a 181M footprint across 210 jars and solves the problem in about 20s. > > > > Ignite has a 87M footprint across 85 jars and solves the problem in > > > > > 40s. But I can also find more complex scenarios which need to scale > > > > where Ignite and Spark really come into their own. > > > > ---CUT--- > > > > > > > > A similar rationale was behind my developing/using the SOFM > > > > functionality in the "o.a.c.m.ml.neuralnet" package: I needed a > > > > proof of concept, and taking the "lightweight" path seemed more > > > > effective than experimenting with those platforms. > > > > Admittingly, at that epoch, there were people around, who were > > > > maintaining the clustering and GA codes; hence, the prototyping > > > > of a machine-learning library didn't look strange to anyone. > > > > > > > > Regards, > > > > Gilles > > > > > > > > >>> [...] > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > -- Avijit Basak