Hi Matt! There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request. Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before.
On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> wrote: > I built an Aggregator that computes PCA on grouped datasets. I wanted to > use the PCA functions provided by MLlib, but they only work on a full > dataset, and I needed to do it on a grouped dataset (like a > RelationalGroupedDataset). > > So I built a little Aggregator that can do that, here’s an example of how > it’s called: > > val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn > > // For each grouping, compute a PCA matrix/vector > val pcaModels = inputData > .groupBy(keys:_*) > .agg(pcaAggregation.as(pcaOutput)) > > I used the same algorithms under the hood as > RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works > directly on Datasets without converting to RDD first. > > I’ve seen others who wanted this ability (for example on Stack Overflow) > so I’d like to contribute it if it would be a benefit to the larger > community. > > So.. is this something worth contributing to MLlib? And if so, what are > the next steps to start the process? > > thanks! >