First some background: * We want to use the k-means model for anomaly detection against a multi-dimensional dataset. The current k-means implementation in Spark is designed for clustering purpose, not exactly for anomaly detection. Once a model is trained and pipeline is instantiated, the prediction data frame generated from the transform function only associates each data points with individual clusters. To enable anomaly detection, we would need to recalculate distance of each data point to its corresponding or nearest cluster centroid, and compare with a predefined threshold value to determine anomalies (e.g. normal = distance <= threshold, and anomaly = distance > threshold). * The anomaly detection procedure (e.g. calculating the distances and compare them with the threshold) occurs outside the ML pipeline (e.g. after invoking the transform method). This causes problems when we try to persist the pipeline model and later retrieve and instantiate and use it in production. We really would like one Estimator to do this whole process, from ingesting data to anomaly detection in a single pipeline, without the extra code at the end (e.g. after pipeline.transform() is called).
Questions: * We wanted to just make a custom Transformer to append to the end of the Pipeline so to enable anomaly detection for the test dataset, BUT it requires the clusterCenters from the KMeansModel stage. We can’t figure out how to pass this data, which comes from a fitted stage, to a later stage during runtime. Any Ideas? * Is there a way add a callback to the KMeansModel to persist the clusterCenters in the dataframe, or in a file? or add a ParamMap to dynamically set this parameter during runtime? Thanks a lot in advance! -- ND