[ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18704:
-------------------------------
    Description: 
Currently CrossValidator will train (k-fold * paramMaps) different models 
during the training process, yet it only passes the average metrics to 
CrossValidatorModel. From which some important information like variances for 
the same paramMap cannot be retrieved, and users cannot be sure if the k number 
is proper. Since the CrossValidator is relatively expensive, we probably want 
to get the most from the tuning process.

Just want to see if this sounds good. In my opinion, this can be done either by 
passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
which can also be used by TrainValidationSplit. In the summary we can present a 
better statistics for the tuning process. Something like a DataFrame:
+---------------+------------+--------+-----------------+
|elasticNetParam|fitIntercept|regParam|metrics          |
+---------------+------------+--------+-----------------+
|0.0            |true        |0.1     |9.747795248932505|
|0.0            |true        |0.01    |9.751942357398603|
|0.0            |false       |0.1     |9.71727627087487 |
|0.0            |false       |0.01    |9.721149803723822|
|0.5            |true        |0.1     |9.719358515436005|
|0.5            |true        |0.01    |9.748121645368501|
|0.5            |false       |0.1     |9.687771328829479|
|0.5            |false       |0.01    |9.717304811419261|
|1.0            |true        |0.1     |9.696769467196487|
|1.0            |true        |0.01    |9.744325276259957|
|1.0            |false       |0.1     |9.665822167122172|
|1.0            |false       |0.01    |9.713484065511892|
+---------------+------------+--------+-----------------+

Using the dataFrame, users can better understand the effect of different 
parameters.

Another thing we should improve is to include the paramMaps in the 
CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
serialization. Keeping only the metrics without ParamMaps does not really help 
model reuse.


  was:
Currently CrossValidator will train (k-fold * paramMaps) different models 
during the training process, yet it only passes the average metrics to 
CrossValidatorModel. From which some important information like variances for 
the same paramMap cannot be retrieved, and users cannot be sure if the k number 
is proper. Since the CrossValidator is relatively expensive, we probably want 
to get the most from the tuning process.

Just want to see if this sounds good. In my opinion, this can be done either by 
passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
which can also be used by TrainValidationSplit. In the summary we can present a 
better statistics for the tuning process. Something like a DataFrame:
+---------------+------------+--------+-----------------+
|elasticNetParam|fitIntercept|regParam|metrics          |
+---------------+------------+--------+-----------------+
|0.0            |true        |0.1     |9.747795248932505|
|0.0            |true        |0.01    |9.751942357398603|
|0.0            |false       |0.1     |9.71727627087487 |
|0.0            |false       |0.01    |9.721149803723822|
|0.5            |true        |0.1     |9.719358515436005|
|0.5            |true        |0.01    |9.748121645368501|
|0.5            |false       |0.1     |9.687771328829479|
|0.5            |false       |0.01    |9.717304811419261|
|1.0            |true        |0.1     |9.696769467196487|
|1.0            |true        |0.01    |9.744325276259957|
|1.0            |false       |0.1     |9.665822167122172|
|1.0            |false       |0.01    |9.713484065511892|
+---------------+------------+--------+-----------------+

Using the dataFrame, users can better understand the effect of different 
parameters.





> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
>                 Key: SPARK-18704
>                 URL: https://issues.apache.org/jira/browse/SPARK-18704
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either 
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
> CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
> which can also be used by TrainValidationSplit. In the summary we can present 
> a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics          |
> +---------------+------------+--------+-----------------+
> |0.0            |true        |0.1     |9.747795248932505|
> |0.0            |true        |0.01    |9.751942357398603|
> |0.0            |false       |0.1     |9.71727627087487 |
> |0.0            |false       |0.01    |9.721149803723822|
> |0.5            |true        |0.1     |9.719358515436005|
> |0.5            |true        |0.01    |9.748121645368501|
> |0.5            |false       |0.1     |9.687771328829479|
> |0.5            |false       |0.01    |9.717304811419261|
> |1.0            |true        |0.1     |9.696769467196487|
> |1.0            |true        |0.01    |9.744325276259957|
> |1.0            |false       |0.1     |9.665822167122172|
> |1.0            |false       |0.01    |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different 
> parameters.
> Another thing we should improve is to include the paramMaps in the 
> CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
> serialization. Keeping only the metrics without ParamMaps does not really 
> help model reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to