First some background:

 * We want to use the k-means model for anomaly detection against a
   multi-dimensional dataset.  The current k-means implementation in
   Spark is designed for clustering purpose, not exactly for anomaly
   detection.  Once a model is trained and pipeline is instantiated,
   the prediction data frame generated from the transform function only
   associates each data points with individual clusters.  To enable
   anomaly detection, we would need to recalculate distance of each
   data point to its corresponding or nearest cluster centroid, and
   compare with a predefined threshold value to determine anomalies
   (e.g. normal = distance <= threshold, and anomaly = distance >
   threshold).
 * The anomaly detection procedure (e.g. calculating the distances and
   compare them with the threshold) occurs outside the ML pipeline
   (e.g. after invoking the transform method). This causes problems
   when we try to persist the pipeline model and later retrieve and
   instantiate and use it in production. We really would like one
   Estimator to do this whole process, from ingesting data to anomaly
   detection in a single pipeline, without the extra code at the end
   (e.g. after pipeline.transform() is called).

Questions:

 * We wanted to just make a custom Transformer to append to the end of
   the Pipeline so to enable anomaly detection for the test dataset,
   BUT it requires the clusterCenters from the KMeansModel stage.  We
   can’t figure out how to pass this data, which comes from a fitted
   stage, to a later stage during runtime. Any Ideas?
 * Is there a way add a callback to the KMeansModel to persist the
   clusterCenters in the dataframe, or in a file?  or add a ParamMap to
   dynamically set this parameter during runtime?

Thanks a lot in advance!

-- ND

Reply via email to