Any help on the below?
On 19-Jan-2018 7:12 PM, "Aakash Basu" <[email protected]> wrote:
> Hi all,
>
> I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model
> Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any
> API for ROC_Curve calculation for BinaryClassification in SparkMLLib.
>
> The codes below have a wrapper function which is creating the respective
> dataframe from the source data with two columns which is as attached.
>
> I want to achieve the same result as Python code in the Spark to get the
> roc_curve. Is there any API from MLLib side to achieve the same?
>
> Python sklearn Code -
>
> def roc(self, y_true, y_pred):
> df_a = self._df.copy()
> values_1_tmp = df_a[y_true].values
> values_1_tmp2 = values_1_tmp[~np.isnan(values_1_tmp)]
> values_1 = values_1_tmp2.astype(int)
> values_2_tmp = df_a[y_pred].values
> values_2_tmp2 = values_2_tmp[~np.isnan(values_2_tmp)]
> values_2 = values_2_tmp2.astype(int)
> specificity, sensitivity, thresholds = metrics.roc_curve(values_1,
> values_2, pos_label=2)
> # area_under_roc = metrics.roc_auc_score(values_1, values_2)
> print(sensitivity, specificity)
> return sensitivity, specificity
>
> Result:
>
> [ 0. 0.34138342 0.67412045 1. ] [ 0.
> 0.33373458 0.67378875 1. ]
>
>
> PySpark Code -
>
> def roc(self, y_true, y_pred):
> print('using pyspark df')
> df_a = self._df
> values_1 = list(df_a[y_true, y_pred].toPandas().values)
> new_list = [l.tolist() for l in values_1]
>
> double_list = []
> for myList in new_list:
> temp = []
> for item in myList:
> temp.append(float(item))
> double_list.append(temp)
>
> new_rdd = self._sc.parallelize(double_list)
> metrics = BinaryClassificationMetrics(new_rdd)
> roc_calc = metrics.areaUnderROC
> print(roc_calc)
> print(type(roc_calc))
> return 1
>
>
> Please help.
>
> Thanks,
> Aakash.
>