[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

mengxr Wed, 05 Nov 2014 19:28:39 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-61922261
  
    @shivaram Thanks for your feedback!
    
    > Could we have constructors also with getter, setters?
    
    It would be hard to maintain binary compatibility in the long run. After we 
remove the @Experimental tag, each time we add a new parameter, we need to 
create a new auxiliary constructor. It is also easier to deprecate parameters.
    
    It will also make Scala code different from Java's, which we try to avoid. 
Using getters and setters, the Java code `JavaLogisticRegressionSuite` looks 
very similar to the Scala code `LogisticRegressionSuite`.
    
    > Also, FWIW I am not very sure about having setter, getters using traits 
like HasRegParam etc. This creates a lot of classes as we add more classes each 
having 2-3 params ?
    
    I'm okay with either approach. I felt this is a nice feature of Scala. Some 
parameters are very common, e.g., `regParam`, `maxIter`, and `featuresCol`. It 
is tedious to copy the same code around. And we can group parameter traits into 
something like `IterativeParams` or `RegressionModelParams` later. However, 
when  I coded this up, I found Java doesn't interpret the return type 
correctly. So I have to override the setters in each class, which is bad.
    
    > Passing params across modules.
    
    The model parameter like `featuresCol` is set before training. So we still 
need to deal with the logic after training:
    
    1) If `lr.model.featuresCol` is set, use it.
    2) If `lr.model.featuresCol` is not set, use `lr.featuresCol` or keep the 
default if `lr.featuresCol` is not set either.
    
    > In addition to the SchemaRDD based transformer API, it would be good to 
have a single Row transformation API too.
    
    I agree. We need to know the schema in order to transform individual 
instances. The row object doesn't have reference to its schema. We can add 
`schema` as a parameter, which is required to transform a row. Btw, we will 
keep the `predict` method that works on a single Vector.
    
    > Other than the API, the biggest worry I have is in terms of memory 
management.
    
    We can call `unpersist` inside `StandardScaler` after we computed the mean 
and the standard deviation. It is not an issue here but I got your point. We 
can add pipeline components that persist/checkpoint input datasets. When to 
`unpersist` is always the problem. I'll think about it.
    
    > Also if you don't mind I'll try to port some of the image processing 
pipelines we have to this class structure by forking your branch. I feel 
that'll help me figure out if what features are easy to use etc.
    
    That's great! I felt that making the code compile is really helpful to see 
the trade-offs. But please understand that this is a WIP and I may update the 
code.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to