You could fit the k-means pipeline, get the cluster centers, create a Transformer using that info, then create a new PipelineModel including all the original elements and the new Transformer. Does that work? It's not out of the question to expose a new parameter in KMeansModel that lets you also add a column with the cost; I'd review that kind of PR.
On Tue, Jan 12, 2021 at 12:59 PM Artemis User <arte...@dtechspace.com> wrote: > First some background: > > - We want to use the k-means model for anomaly detection against a > multi-dimensional dataset. The current k-means implementation in Spark is > designed for clustering purpose, not exactly for anomaly detection. Once a > model is trained and pipeline is instantiated, the prediction data frame > generated from the transform function only associates each data points with > individual clusters. To enable anomaly detection, we would need to > recalculate distance of each data point to its corresponding or nearest > cluster centroid, and compare with a predefined threshold value to > determine anomalies (e.g. normal = distance <= threshold, and anomaly = > distance > threshold). > - The anomaly detection procedure (e.g. calculating the distances and > compare them with the threshold) occurs outside the ML pipeline (e.g. after > invoking the transform method). This causes problems when we try to > persist the pipeline model and later retrieve and instantiate and use it in > production. We really would like one Estimator to do this whole process, > from ingesting data to anomaly detection in a single pipeline, without the > extra code at the end (e.g. after pipeline.transform() is called). > > Questions: > > - We wanted to just make a custom Transformer to append to the end of > the Pipeline so to enable anomaly detection for the test dataset, BUT it > requires the clusterCenters from the KMeansModel stage. We can’t figure > out how to pass this data, which comes from a fitted stage, to a later > stage during runtime. Any Ideas? > - Is there a way add a callback to the KMeansModel to persist the > clusterCenters in the dataframe, or in a file? or add a ParamMap to > dynamically set this parameter during runtime? > > Thanks a lot in advance! > > -- ND >