Re: The case for a Commons component

Ralph Goers Mon, 26 Apr 2021 08:08:41 -0700

How many committers will be active for this component?

Ralph


> On Apr 26, 2021, at 7:17 AM, Avijit Basak <avijit.ba...@gmail.com> wrote:
> 
> Hi
> 
>        As per previous discussions, I have created a temporary repository
> in GitHub under my personal GitHub Id(avijitbasak). The artifacts have been
> copied from commons-numbers. A preliminary structure has been created for
> the proposed component.
> Please let me know if we want to proceed with this format. We can copy the
> same to any other team repository if required.
> 
>        Repo URL: https://github.com/avijitbasak/commons-machinelearning
> 
> Thanks & Regards
> --Avijit Basak
> 
> On Mon, 26 Apr 2021 at 04:49, Paul King <paul.king.as...@gmail.com> wrote:
> 
>> On Mon, Apr 26, 2021 at 12:27 AM sebb <seb...@gmail.com> wrote:
>>> 
>>> I assume this thread is about the possible ML component.
>>> 
>>> If the code was developed by Commons, I assume it could be used as
>>> part of Spark.
>>> However Commons does not currently have many developers who are
>>> familiar with the field.
>>> So it would seem to me better to have development done by a project
>>> which does have relevant experience.
>>> 
>>> You say that Spark etc have lots of jars.
>>> Surely that allows for it to be implemented as a separate jar which
>>> can either be used as part of the Spark platform, or used
>>> independently?
>> 
>> The stats I gave were for the current minimal use of those algorithms.
>> Most algorithms are written in Scala, use RDD "dataframes" rather than
>> say double arrays, and assume you're running on "the platform" which
>> handles how you might get your data and return results and do logging
>> etc. in a potentially concurrent world. Some of those design choices
>> are key to scaling up but don't align with the goal of making the
>> algorithms runnable "independently".
>> 
>>> The only other option I see is for Commons to persuade some developers
>>> who are familiar with the field to join Commons to assist with the
>>> algorithms.
>> 
>> I agree that is the crux of the issue here. The "commons doesn't have
>> the bandwidth to absorb another algorithm" part of the discussion
>> seems perfectly legit to me. The "and there is an obvious home
>> elsewhere" part of the discussion seemed a little more dubious to me,
>> though obviously that is something which should be considered.
>> 
>>> Existing Commons developers can help manage the logistics of packaging
>>> and releasing the code, as this does not require in depth knowledge of
>>> the design.
>>> However this only makes sense if the developers skilled in the are are
>>> prepared to assist long-term.
>>> 
>>> 
>>> On Sat, 24 Apr 2021 at 23:32, Paul King <paul.king.as...@gmail.com>
>> wrote:
>>>> 
>>>> Thanks Gilles,
>>>> 
>>>> I can provide the same sort of stats across a clustering example
>>>> across commons-math (KMeans) vs Apache Ignite, Apache Spark and
>>>> Rheem/Apache Wayang (incubating) if anyone would find that useful. It
>>>> would no doubt lead to similar conclusions.
>>>> 
>>>> Cheers, Paul.
>>>> 
>>>> On Sun, Apr 25, 2021 at 8:15 AM Gilles Sadowski <gillese...@gmail.com>
>> wrote:
>>>>> 
>>>>> Hello Paul.
>>>>> 
>>>>> Le sam. 24 avr. 2021 à 04:42, Paul King <paul.king.as...@gmail.com>
>> a écrit :
>>>>>> 
>>>>>> I added some more comments relevant to if the proposed algorithm
>>>>>> belongs somewhere in the commons "math" area back in the Jira:
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/MATH-1563
>>>>> 
>>>>> Thanks for a "real" user's testimony.
>>>>> 
>>>>> As the ML is still the official forum for such a discussion, I'm
>> quoting
>>>>> part of your post on JIRA:
>>>>> ---CUT---
>>>>> For linear regression, taking just one example dataset, commons-math
>>>>> is a couple of library calls for a single 2M library and solves the
>>>>> problem in 240ms. Both Ignite and Spark involve "firing up the
>>>>> platform" and the code is more complex for simple scenarios. Spark
>> has
>>>>> a 181M footprint across 210 jars and solves the problem in about 20s.
>>>>> Ignite has a 87M footprint across 85 jars and solves the problem in >
>>>>> 40s. But I can also find more complex scenarios which need to scale
>>>>> where Ignite and Spark really come into their own.
>>>>> ---CUT---
>>>>> 
>>>>> A similar rationale was behind my developing/using the SOFM
>>>>> functionality in the "o.a.c.m.ml.neuralnet" package: I needed a
>>>>> proof of concept, and taking the "lightweight" path seemed more
>>>>> effective than experimenting with those platforms.
>>>>> Admittingly, at that epoch, there were people around, who were
>>>>> maintaining the clustering and GA codes; hence, the prototyping
>>>>> of a machine-learning library didn't look strange to anyone.
>>>>> 
>>>>> Regards,
>>>>> Gilles
>>>>> 
>>>>>>>> [...]
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>> 
>> 
> 
> -- 
> Avijit Basak



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: The case for a Commons component

Reply via email to