Hi everyone!
I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
MLlib.stat.test, in the file ChiSqTest.scala under
/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
by the usage of mapPartitions API in the function  
def chiSquaredFeatures(data: RDD[LabeledPoint],
      methodName: String = PEARSON.name): Array[ChiSqTestResult]

According to my statistical testing knowledge, Chi-Square test requires
large numbers (>5 for 80% entries) in its contingency matrix in order to
satisfy good approximation
(http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). Thus the number
of feature & label categories cannot be too large because if otherwise,
there would be too few items in each categories, which fails to meet  the
constraint in usage of Chi-square test. 

I do see in the function above, Spark will throw exceptions when
distinctLabels.size and distinctFeatures.size exceed maxCategories defined
as 10000, but the  two HashSets distinctLabels and distinctFeatures are
initialized inside mapPartition, which means Spark will only be sensitive to
the number of feature & label categories in one partition. This will make
the reduced result---contingency matrix still have exceeded number of
categories and thus small matrix entries which makes Chi-Square inaccurate.
I've made a unit test on this function, which proves the case. 

Maybe I am just being trapped by a misunderstanding. Could any one please
give me a hint on this issue?



-----
Feel the sparking Spark!
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-this-a-bug-in-MLlib-stat-test-About-the-mapPartitions-API-used-in-Chi-Squared-test-tp11015.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to