[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

mengxr Wed, 05 Nov 2014 19:48:07 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-61923441
  
    @manishamde Thanks for your feedback!
    
    > Like @jkbradley, I prefer lr.setMaxIter(50) over lr.set(lr.maxIter, 50).
    
    Me too and this is implemented in the current version.
    
    > Also, prefer to avoid passing parameters to fit like lr.fit(dataset, 
lr.maxIter -> 50).
    
    1. It is useful when you want to try some parameters interactively. For 
example, I'm using the default setting and I want to see how each parameter 
affects the result. I can try:
    
    ~~~
    val lr = new LogisticRegression()
      .setMaxIter(20)
      .setRegParam(0.1)
    lr.fit(dataset, lr.maxIter -> 50)
    lr.fit(dataset, lr.regParam -> 1.0) // note that maxIter goes back to the 
default setting
    ~~~
    
    If we only allow setters, the code will be
    
    ~~~
    val previousMaxIter = lr.getMaxIter
    lr.setMaxIter(50)
    lr.fit(dataset)
    lr.setMaxIter(previousMaxIter)
    val previousRegParam = lr.getRegParam
    lr.fit(dataset)
    lr.setRegParam(previousRegParam)
    ~~~
    
    Another reason I want to have parameters specified in `fit` is for 
multi-model training, as described in the design doc.
    
    > Constructors with getters and setters as @shivaram pointed will be great. 
The LOC reduction is important and should not be discounted.
    
    Besides the binary compatibility issue and Java issue I mentioned, it 
doesn't save you many characters:
    
    ~~~
    val lr = new LogisticRegression(maxIter = 50, regParam = 0.1)
    
    val lr = new LogisticRegression()
      .setMaxIter(50)
      .setRegParam(0.1)
    ~~~
    
    > Do we plan to provided syntactic sugar such as a predict method when we 
use model to transform a dataset? For me transform fits well with the feature 
engineering stage and predict after the model training has been performed.
    
    I think we should keep methods that operate on normal RDDs and individual 
instances.
    
    > It will be great to see the corresponding examples in Python.The 
getter/setters would map well to Python properties. Also, it will be nice to do 
an apples-to-apples comparison with the scikit-learn pipeline.
    
    We need to deal with the serialization of objects and parameters. @davies 
is the expert. I expect the Python API be very similar to Scala/Java API.
    
    > Finally, how do we plan to programatically answer (developer/user) 
queries about algorithm properties such as multiclass classification support, 
using internal storage format, etc.
    
    This is beyond this PR. SPARK-3702 is relevant to your question, which 
@jkbradley is working on.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to