Hi all, We have a dataframe with 2.5 millions of records and 13 features. We want to perform a logistic regression with this data but first we neet to divide each columns in discrete values using QuantileDiscretizer. This will improve the performance of the model by avoiding outliers.
For small dataframes QuantileDiscretizer works perfect (see the ml example: https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer), but for large data frames it tends to split the column in only the values 0 and 1 (despite the custom number of buckets is settled in to 5). Here is my code: val discretizer = new QuantileDiscretizer() .setInputCol("C4") .setOutputCol("C4_Q") .setNumBuckets(5) val result = discretizer.fit(df3).transform(df3) result.show() I found the same problem presented here: https://issues.apache.org/jira/browse/SPARK-13444 . But there is no solution yet. Do I am configuring the function in a bad way? Should I pre-process the data (like z-scores)? Can somebody help me dealing with this? Regards