Spark ML Pipeline inaccessible types

zapletal-martin Wed, 25 Mar 2015 04:02:27 -0700

Hi,



I have started implementing a machine learning pipeline using Spark 1.3.0 
and the new pipelining API and DataFrames. I got to a point where I have my 
training data set prepared using a sequence of Transformers, but I am 
struggling to actually train a model and use it for predictions.




I am getting a java.lang.NoSuchMethodException: org.apache.spark.ml.
regression.LinearRegression.myFeaturesColumnName() exception thrown at 
checkInputColumn method in Params trait when using a Predictor (
LinearRegression in my case, but that should not matter). This looks like a 
bug - the exception is thrown when executing getParam(colName) when the 
require(actualDataType.equals(datatype), ...) requirement is not met so the 
expected requirement failed exception is not thrown and is hidden by the 
unexpected NoSuchMethodException instead. I can raise a bug if this really 
is an issue and I am not using something incorrectly.




The problem I am facing however is that the Predictor expects features to 
have VectorUDT type as defined in Predictor class (protected def 
featuresDataType: DataType = new VectorUDT). But since this type is private
[spark] my Transformer can not prepare features with this type which then 
correctly results in the exception above when I use a different type.




Is there a way to define a custom Pipeline that would be able to use the 
existing Predictors without having to bypass the access modifiers or 
reimplement something or is the pipelining API not yet expected to be used 
in this way?




Thanks,

Martin

Spark ML Pipeline inaccessible types

Reply via email to