[ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721523#comment-15721523
 ] 

yuhao yang commented on SPARK-18704:
------------------------------------

Glad to have your attention. In 
https://github.com/hhbyyh/spark/blob/tuningsummary/mllib/src/main/scala/org/apache/spark/ml/tuning/TuningSummary.scala#L40,
 I got an implementation of TuningSummary which is really for 
TrainValidationSplit. If you'd like, feel free to work on CrossValidator.

> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
>                 Key: SPARK-18704
>                 URL: https://issues.apache.org/jira/browse/SPARK-18704
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either 
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
> CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
> which can also be used by TrainValidationSplit. In the summary we can present 
> a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics          |
> +---------------+------------+--------+-----------------+
> |0.0            |true        |0.1     |9.747795248932505|
> |0.0            |true        |0.01    |9.751942357398603|
> |0.0            |false       |0.1     |9.71727627087487 |
> |0.0            |false       |0.01    |9.721149803723822|
> |0.5            |true        |0.1     |9.719358515436005|
> |0.5            |true        |0.01    |9.748121645368501|
> |0.5            |false       |0.1     |9.687771328829479|
> |0.5            |false       |0.01    |9.717304811419261|
> |1.0            |true        |0.1     |9.696769467196487|
> |1.0            |true        |0.01    |9.744325276259957|
> |1.0            |false       |0.1     |9.665822167122172|
> |1.0            |false       |0.01    |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different 
> parameters.
> Another thing we should improve is to include the paramMaps in the 
> CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
> serialization. Keeping only the metrics without ParamMaps does not really 
> help model reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to