[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

Peng Meng (JIRA) Tue, 11 Oct 2016 04:45:38 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565225#comment-15565225
 ]


Peng Meng commented on SPARK-17870:
-----------------------------------

yes, the selectKBest and selectPercentile in scikit learn only use statistic.
Because the method to count ChiSquare value is different, the DoF of all 
features in scikit learn are the same. so it can do that.

The ChiSquare Value compute process is like this:
 suppose we have data:
X = [ 8 7 0
         0 9 6
         0 9 8
         8 9 5]
y = [0 1 1 2]T, this is the test suite data of 
ml/feature/ChiSquareSelectorSuite.scala
sci-kit learn to compute chiSquare value is like this:
first:
Y = [1 0 0
        0 1 0
        0  1 0
        0  0 1]
observed = Y'*X=
[8  7    0
 0  18 14
 8   9   5]
expected = 
[4 8.5 4.75
 8 17  9.5
 4  8.5  4.75]
_chisquare(ovserved, expected): to compute all features ChiSquare value, we can 
see all the DF of each feature is the same.

Bug for spark Statistics.chiSqTest(RDD), is use another method, for each 
feature, construct a contingency table. So the DF is different for each 
feature.  

For "gives different results from ranking on the statistic", this is because 
the parameters different.
For previous example, if use SelectKBest(2), the selected feature is the same 
as SelectFpr(0.2) in scikit learn


         


> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> ---------------------------------------------
>
>                 Key: SPARK-17870
>                 URL: https://issues.apache.org/jira/browse/SPARK-17870
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>            Reporter: Peng Meng
>            Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

Reply via email to