Re: [MLlib] PCA Aggregator

2018-10-19 Thread Sean Owen
I think this is great info and context to put in the JIRA. On Fri, Oct 19, 2018, 6:53 PM Matt Saunders wrote: > Hi Sean, thanks for your feedback. I saw this as a missing feature in the > existing PCA implementation in MLlib. I suspect the use case is a common > one: you have data from different

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Matt Saunders
Hi Sean, thanks for your feedback. I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model the

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Sean Owen
It's OK to open a JIRA though I generally doubt any new functionality will be added. This might be viewed as a small worthwhile enhancement, haven't looked at it. It's always more compelling if you can sketch the use case for it and why it is more meaningful in spark than outside it. There is spar

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Matt Saunders
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a whil

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
For 3rd-party libs, I have been publishing independently, for example at isarn-sketches-spark or silex: https://github.com/isarn/isarn-sketches-spark https://github.com/radanalyticsio/silex Either of these repos provide some good working examples of publishing a spark UDAF or ML library for jvm an

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Stephen Boesch
Erik - is there a current locale for approved/recommended third party additions? The spark-packages has been stale for years it seems. Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson < eerla...@redhat.com>: > Hi Matt! > > There are a couple ways to do this. If you want to submit it for

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
Hi Matt! There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request. Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before. On Wed, Oct 17, 2018 at

[MLlib] PCA Aggregator

2018-10-17 Thread Matt Saunders
I built an Aggregator that computes PCA on grouped datasets. I wanted to use the PCA functions provided by MLlib, but they only work on a full dataset, and I needed to do it on a grouped dataset (like a RelationalGroupedDataset). So I built a little Aggregator that can do that, here’s an example o