Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61922261
@shivaram Thanks for your feedback!
> Could we have constructors also with getter, setters?
It would be hard to maintain binary compatibility in the long run. After we
remove the @Experimental tag, each time we add a new parameter, we need to
create a new auxiliary constructor. It is also easier to deprecate parameters.
It will also make Scala code different from Java's, which we try to avoid.
Using getters and setters, the Java code `JavaLogisticRegressionSuite` looks
very similar to the Scala code `LogisticRegressionSuite`.
> Also, FWIW I am not very sure about having setter, getters using traits
like HasRegParam etc. This creates a lot of classes as we add more classes each
having 2-3 params ?
I'm okay with either approach. I felt this is a nice feature of Scala. Some
parameters are very common, e.g., `regParam`, `maxIter`, and `featuresCol`. It
is tedious to copy the same code around. And we can group parameter traits into
something like `IterativeParams` or `RegressionModelParams` later. However,
when I coded this up, I found Java doesn't interpret the return type
correctly. So I have to override the setters in each class, which is bad.
> Passing params across modules.
The model parameter like `featuresCol` is set before training. So we still
need to deal with the logic after training:
1) If `lr.model.featuresCol` is set, use it.
2) If `lr.model.featuresCol` is not set, use `lr.featuresCol` or keep the
default if `lr.featuresCol` is not set either.
> In addition to the SchemaRDD based transformer API, it would be good to
have a single Row transformation API too.
I agree. We need to know the schema in order to transform individual
instances. The row object doesn't have reference to its schema. We can add
`schema` as a parameter, which is required to transform a row. Btw, we will
keep the `predict` method that works on a single Vector.
> Other than the API, the biggest worry I have is in terms of memory
management.
We can call `unpersist` inside `StandardScaler` after we computed the mean
and the standard deviation. It is not an issue here but I got your point. We
can add pipeline components that persist/checkpoint input datasets. When to
`unpersist` is always the problem. I'll think about it.
> Also if you don't mind I'll try to port some of the image processing
pipelines we have to this class structure by forking your branch. I feel
that'll help me figure out if what features are easy to use etc.
That's great! I felt that making the code compile is really helpful to see
the trade-offs. But please understand that this is a WIP and I may update the
code.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]