@Trevor Grant The landscape in machine learning is getting more and more diluted with lots of tools, here's a question, given that some folks are taking R and connecting it to spark and map reduce to make the R algorithms work at scale (https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler) what would be the additional value added in porting the R code using the algorithms/samsara framework, to me the MRS efforts and the approach you are proposing are 2 parallel tracks, as far as the barriers to entry to contributing I think its largely due to the complexity of the codebase and the lack of familiarity with Samsara, I'd love to help create some good docs/tutorials on both the algorithms framework and samsara when and where it makes sense, however I feel like it'd be useful to really identify the use cases where using the algorithms/samsara approach has clear wins versus MRS with spark or spark by itself or python/scikit-learn, I've found that in general people dont really need custom algorithms in datascience , they typically are answering some very basic classification or clustering question and can use linear/logistic regression or a variant of kmeans. I'd also like to help dig into some use cases with Samsara and put those use cases maybe in the examples section.
Thoughts? ScaleR Functions - msdn.microsoft.com<https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler> msdn.microsoft.com The RevoScaleR package provides a set of over one hundred portable, scalable, and distributable data analysis functions. This topic presents a curated list ... ________________________________ From: Trevor Grant <[email protected]> Sent: Tuesday, February 7, 2017 8:47 AM To: [email protected]; [email protected] Subject: Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation The idea that Andy briefly touched on, is that the Algorithm Framework (hopefully) paves the way for R/CRAN like user contribution. Increased contribution was a goal I had certainly hoped for. I have begun promoting the idea at Meetups. There hasn't been a concerted effort to push the idea, however it is a tagline / call to action I am planning on pushing at talks and conferences this spring. Thank you for raising the issue on the mailing list as well. Using the Samsara framework and "Algorithms" framework, it is hoped the the barrier to entry for new contributors will be very low, and that they can introduce new algorithms or port them from R. Other 'Big Data' Machine Learning frameworks suffer because they are not easily extensible. The algorithms framework makes it (more) clear where a new algorithm would go, and in general how it should behave. E.g. This is a Regressor, ok probably goes in the regressor package- it needs a fit method that takes a DrmX and a DrmY, and a predict method that takes DrmX and returns DrmY_hat). The algorithms framework also provides a consistent interface across algorithms and puts up "guard rails" to ensure common things are done in an efficient manner (e.g. Serializing just the model, not the fitter and additional unneeded things, thank you Dmitriy). The Samsara framework makes it easy to 'read' what the person is doing. This makes it easier to review PRs, encourages community review, and if (hopefully not, but in case it does happen) someone makes a so-called 'drive by commit', that is commits an algorithm and is never heard of again, others can easily understand and maintain the algorithm in the persons absence. There are a number of issues labeled as beginner in JIRA now, especially with respect to the Algorithms package. It would probably be good to include a lot of this information in a web page either here https://mahout.apache.org/developers/how-to-contribute.html Apache Mahout: Scalable machine learning and data mining<https://mahout.apache.org/developers/how-to-contribute.html> mahout.apache.org How to contribute¶ Contributing to an Apache project is about more than just writing code -- it's about doing what you can to make the project better. or on a page that is linked to by that. Which leads me in to the last 'piece of the puzzle' I would like to have in place before aggressively advertising this as a "new-contributor friendly" project, migrating CMS to Jekyll https://issues.apache.org/jira/browse/MAHOUT-1933 The rationale for that is so when new algorithms are submitted, the PR will include relevant documentation (as a convention) and that documentation can be corrected / expanded as needed in a more non-committer friendly manner. Trevor Grant Data Scientist https://github.com/rawkintrevo [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo> rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo> github.com rawkintrevo has 22 repositories available. Follow their code on GitHub. http://stackexchange.com/users/3002022/rawkintrevo User rawkintrevo - Stack Exchange<http://stackexchange.com/users/3002022/rawkintrevo> stackexchange.com Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation activity favorites subscriptions. Top Questions http://trevorgrant.org [https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/> The musings of rawkintrevo<http://trevorgrant.org/> trevorgrant.org Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons. *"Fortunate is he, who is able to know the causes of things." -Virgil* On Tue, Feb 7, 2017 at 4:30 AM, Isabel Drost <[email protected]> wrote: > On Wed, Feb 01, 2017 at 03:32:24PM -0800, Dmitriy Lyubimov wrote: > > Isabel, if i understand it correctly, you are asking whether it makes > sense > > add end2end scenarios based on Samsara to current codebase? > > Sorry for being fuzzy. The meta question that I'm trying to find an answer > for > is if there's something can/ should be done to increase the number of > people > that potentially could be assimilated and turned into committers one day. > One > specific idea I had on my mind was to make the project easier to use for > beginners, one idea to get that accomplished I had was to focus on end to > end > implementations of popular use cases. (Sorry, fairly meta...) > > > > The answer is, absolutely. Yes it does for both rather isolated issues > > (like computing clusters) and end-2-end scenarios. > > > > The only problem with end 2 end scenarious is they often difficult to > > demonstrate with batch-oriented coputational system only. That's what > > prediction.io kind of picked on with COO, they included all of data > > ingestion, computation and real time scoring queries. > > > > But yes, there's, absolutely, tons of value in that. Not everything fits > > quite nicely, and not everything fits end-2-end (just like with R), but > > some fairly significant pieces do fit to be written on top. > > Makes sense. > > > > > Where do we start? ;) > > > > > > > I would start with figuring a problem I want to solve AND I have a budget > > to do it AND i can legally contribute on behalf of the IP owner. > > I guess given the meta explanation above - if increase in contributions > was a > goal one could also think about making potential areas of contribution > explicit > and highlight the value the project brings compared to other systems with a > specific focus on samsara. That's another angle of me asking weird > questions > here. > > > > Then we can think of whether it is a good fit (Samsara is mostly limited > to > > tensor based data only, just like Mapreduce DRM was/is). Some things may > > not have a convenient algebraic formulation. > > +1 > > Isabel > > -- > Sorry for any typos: Mail was typed in vim, written in mutt, via ssh (most > likely involving some kind of mobile connection only.) >
