Hi Matt,
unfortunately I have no code pointer at hand.
I will sketch how to accomplish this via the API, it will for sure at least
help you getting started.
1) ETL + vectorization (I assume your feature vector to be named "features")
2) You run a clustering algorithm (say KMeans:
https://spark.a
Thanks Alessandro and Christoph. I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?
On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando
alessandro.solima...
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good c
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your
original input data and the associated cluster id for each data point in
the input data. Now you can group this dataset by cluster id and aggregat
I'm using K Means clustering for a project right now, and it's working very
well. However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?