Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a while back (see MADLIB-947), so there does seem to be a demand for it.
thanks! --Matt On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com> wrote: > Hi Matt! > > There are a couple ways to do this. If you want to submit it for inclusion > in Spark, you should start by filing a JIRA for it, and then a pull > request. Another possibility is to publish it as your own 3rd party > library, which I have done for aggregators before. > > > On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> wrote: > >> I built an Aggregator that computes PCA on grouped datasets. I wanted to >> use the PCA functions provided by MLlib, but they only work on a full >> dataset, and I needed to do it on a grouped dataset (like a >> RelationalGroupedDataset). >> >> So I built a little Aggregator that can do that, here’s an example of how >> it’s called: >> >> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn >> >> // For each grouping, compute a PCA matrix/vector >> val pcaModels = inputData >> .groupBy(keys:_*) >> .agg(pcaAggregation.as(pcaOutput)) >> >> I used the same algorithms under the hood as >> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works >> directly on Datasets without converting to RDD first. >> >> I’ve seen others who wanted this ability (for example on Stack Overflow) >> so I’d like to contribute it if it would be a benefit to the larger >> community. >> >> So.. is this something worth contributing to MLlib? And if so, what are >> the next steps to start the process? >> >> thanks! >> >