GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/3099
[WIP][SPARK-3530][MLLIB] pipeline and parameters
### This is a WIP, so please do not spend time on individual lines.
This PR adds package "org.apache.spark.ml" with pipeline and parameters, as
discussed on the JIRA. This is a joint work of @jkbradley @etrain @shivaram and
many others who helped on the design, also with help from @marmbrus and
@liancheng on the Spark SQL side. The design doc can be found at:
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
Again, though it compiles, this is a WIP. So please do not spend time on
individual lines.
The new set of APIs, inspired by the MLI project from AMPLab and
scikit-learn, takes leverage on Spark SQL's schema support and execution plan
optimization. It introduces the following components that help build a
practical pipeline:
1. Transformer, which transforms a dataset into another
2. Estimator, which fits models to data, where models are transformers
3. Evaluator, which evaluates model output and returns a scalar metric
4. Pipeline, a simple pipeline that consists of transformers and estimators
Parameters could be supplied at fit/transform or embedded with parameters.
1. Param: a strong-typed parameter key with self-contained doc
2. ParamMap: a param -> value map
3. Params: trait for components with parameters
4. OwnParamMap: trait for components with embedded parameter map
For any component that implements `Params`, user can easily check the doc
by calling `explainParams`:
~~~
> val lr = new LogisticRegression
> lr.explainParams
maxIter: max number of iterations (default: 100)
regParam: regularization constant (default: 0.1)
labelCol: label column name (default: label)
featuresCol: features column name (default: features)
~~~
or user can check individual param:
~~~
> lr.maxIter
maxIter: max number of iterations (default: 100)
~~~
**Please start with the example code in `LogisticRegressionSuite.scala`,
where I put three examples:**
1. run a simple logistic regression job
~~~
val lr = new LogisticRegression
lr.set(lr.maxIter, 10)
.set(lr.regParam, 1.0)
val model = lr.fit(dataset)
model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
.select('label, 'score, 'prediction).collect()
.foreach(println)
~~~
2. run logistic regression with cross-validation and grid search using
areaUnderROC as the metric
~~~
val lr = new LogisticRegression
val cv = new CrossValidator
val eval = new BinaryClassificationEvaluator
val lrParamMaps = new ParamGridBuilder()
.addMulti(lr.regParam, Array(0.1, 100.0))
.addMulti(lr.maxIter, Array(0, 5))
.build()
cv.set(cv.estimator, lr.asInstanceOf[Estimator[_]])
.set(cv.estimatorParamMaps, lrParamMaps)
.set(cv.evaluator, eval)
.set(cv.numFolds, 3)
val bestModel = cv.fit(dataset)
~~~
3. run a pipeline consists of a standard scaler and a logistic regression
component
~~~
val scaler = new StandardScaler
scaler
.set(scaler.inputCol, "features")
.set(scaler.outputCol, "scaledFeatures")
val lr = new LogisticRegression
lr.set(lr.featuresCol, "scaledFeatures")
val pipeline = new Pipeline
val model = pipeline.fit(dataset, pipeline.stages -> Array(scaler, lr))
val predictions = model.transform(dataset)
.select('label, 'score, 'prediction)
.collect()
.foreach(println)
~~~
**What are missing now and will be added soon:**
1. Runtime check of schemas. So before we touch the data, we will go
through the schema and make sure column names and types match the input
parameters.
2. Java examples.
3. Serialization.
4. Store training parameters in trained models.
**What I'm not very confident with and definitely need feedback:**
The usage of parameters is a little messy. The interface I do want to keep
is the one that supports multi-model training:
~~~
def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Array[Model]
~~~
which leaves space for the algorithm to optimize the training phrase. This
is different from scikit-learn's design, where `fit` returns itself. But for
large-scale datasets, we do want to do training in parallel and returns
multiple models. Now there are two places you can specify parameters:
1. using the embedded paramMap
~~~
lr.set(lr.maxIter, 50)
~~~
2. at fit/transform time:
~~~
lr.fit(dataset, lr.maxIter -> 50)
~~~
The latter overwrites the former. I can change the former to builder pattern
~~~
lr.setMaxIter(50)
~~~
which may look better. But there may be better solutions.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark SPARK-3530
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3099.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3099
----
commit 376db0add8ec05252a1e8ffc2f622f3dfab1b0cd
Author: Xiangrui Meng <[email protected]>
Date: 2014-11-05T01:11:25Z
pipeline and parameters
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]