Hi everyone! I am digging into MLlib of Spark 1.2.1 currently. When reading codes of MLlib.stat.test, in the file ChiSqTest.scala under /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused by the usage of mapPartitions API in the function def chiSquaredFeatures(data: RDD[LabeledPoint], methodName: String = PEARSON.name): Array[ChiSqTestResult]
According to my statistical testing knowledge, Chi-Square test requires large numbers (>5 for 80% entries) in its contingency matrix in order to satisfy good approximation (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). Thus the number of feature & label categories cannot be too large because if otherwise, there would be too few items in each categories, which fails to meet the constraint in usage of Chi-square test. I do see in the function above, Spark will throw exceptions when distinctLabels.size and distinctFeatures.size exceed maxCategories defined as 10000, but the two HashSets distinctLabels and distinctFeatures are initialized inside mapPartition, which means Spark will only be sensitive to the number of feature & label categories in one partition. This will make the reduced result---contingency matrix still have exceeded number of categories and thus small matrix entries which makes Chi-Square inaccurate. I've made a unit test on this function, which proves the case. Maybe I am just being trapped by a misunderstanding. Could any one please give me a hint on this issue? ----- Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-this-a-bug-in-MLlib-stat-test-About-the-mapPartitions-API-used-in-Chi-Squared-test-tp11015.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org