Thanks, Eric. I went ahead and created SPARK-25782 for this improvement
since it is a feature I and others have looked for in MLlib, but doesn't
seem to exist yet. Also, while searching for PCA-related issues in JIRA I
noticed that someone added grouping support for PCA to the MADlib project a
while back (see MADLIB-947), so there does seem to be a demand for it.

thanks!
--Matt


On Fri, Oct 19, 2018 at 7:06 AM Erik Erlandson <eerla...@redhat.com> wrote:

> Hi Matt!
>
> There are a couple ways to do this. If you want to submit it for inclusion
> in Spark, you should start by filing a JIRA for it, and then a pull
> request.   Another possibility is to publish it as your own 3rd party
> library, which I have done for aggregators before.
>
>
> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders <m...@saunders.net> wrote:
>
>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>> use the PCA functions provided by MLlib, but they only work on a full
>> dataset, and I needed to do it on a grouped dataset (like a
>> RelationalGroupedDataset).
>>
>> So I built a little Aggregator that can do that, here’s an example of how
>> it’s called:
>>
>>     val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>
>>     // For each grouping, compute a PCA matrix/vector
>>     val pcaModels = inputData
>>       .groupBy(keys:_*)
>>       .agg(pcaAggregation.as(pcaOutput))
>>
>> I used the same algorithms under the hood as
>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>> directly on Datasets without converting to RDD first.
>>
>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>> so I’d like to contribute it if it would be a benefit to the larger
>> community.
>>
>> So.. is this something worth contributing to MLlib? And if so, what are
>> the next steps to start the process?
>>
>> thanks!
>>
>

Reply via email to