Re: Opening a JIRA for QuantileDiscretizer bug

Ted Yu Mon, 22 Feb 2016 19:08:38 -0800

When you click on Create, you're brought to 'Create Issue' dialog where you
choose Project Spark.
Component should be MLlib.


Please see also:
http://search-hadoop.com/m/q3RTtmsshe1W6cH22/spark+pull+template&subj=pull+request+template


On Mon, Feb 22, 2016 at 6:45 PM, Pierson, Oliver C <o...@gatech.edu> wrote:

> Hello,
>
>   I've discovered a bug in the QuantileDiscretizer estimator.
> Specifically, for large DataFrames QuantileDiscretizer will only create one
> split (i.e. two bins).
>
>
> The error happens in lines 113 and 114 of QuantileDiscretizer.scala:
>
>
>     val requiredSamples = math.max(numBins * numBins, 10000)
>
>     val fraction = math.min(requiredSamples / dataset.count(), 1.0)
>
>
> After the first line, requiredSamples is an Int.  Therefore, if
> requiredSamples > dataset.count() then fraction is always 0.0.
>
>
> The problem can be simply fixed by replacing the first with:
>
>
>   val requiredSamples = math.max(numBins * numBins, 10000.0)
>
>
> I've implemented this change in my fork and all tests passed (except for
> docker integration, but I think that's another issue).  I'm happy to submit
> a PR if it will ease someone else's workload.  However, I'm unsure of how
> to create a JIRA.  I've created an account on the issue tracker (
> issues.apache.org) but when I try to create an issue it asks me to choose
> a "Service Desk".  Which one should I be choosing?
>
>
> Thanks much,
>
> Oliver Pierson
>
>
>
>

Re: Opening a JIRA for QuantileDiscretizer bug

Reply via email to