Hi all,

We have a dataframe with 2.5 millions of records and 13 features. We want
to perform a logistic regression with this data but first we neet to divide
each columns in discrete values using QuantileDiscretizer. This will
improve the performance of the model by avoiding outliers.

For small dataframes QuantileDiscretizer works perfect (see the ml example:
https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
but for large data frames it tends to split the column in only the values 0
and 1 (despite the custom number of buckets is settled in to 5). Here is my
code:

val discretizer = new QuantileDiscretizer()
  .setInputCol("C4")
  .setOutputCol("C4_Q")
  .setNumBuckets(5)

val result = discretizer.fit(df3).transform(df3)
result.show()

I found the same problem presented here:
https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
solution yet.

Do I am configuring the function in a bad way? Should I pre-process the
data (like z-scores)? Can somebody help me dealing with this?

Regards

Reply via email to